Skip to content

[None][fix] kvcache storeContextBlocks toctou#12120

Open
thorjohnsen wants to merge 9 commits intoNVIDIA:mainfrom
thorjohnsen:fix/kvcache-store-context-blocks-toctou
Open

[None][fix] kvcache storeContextBlocks toctou#12120
thorjohnsen wants to merge 9 commits intoNVIDIA:mainfrom
thorjohnsen:fix/kvcache-store-context-blocks-toctou

Conversation

@thorjohnsen
Copy link
Collaborator

@thorjohnsen thorjohnsen commented Mar 11, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced CUDA Virtual Memory Manager with dynamic memory reservation and on-demand page allocation capabilities.
    • Added example demonstrating virtual memory arena usage patterns.
  • Bug Fixes

    • Fixed thread-safety issue in KV Cache Manager to prevent race conditions during sequence operations.
  • Tests

    • Added comprehensive test suite for CUDA Virtual Memory Manager functionality.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

thorjohnsen and others added 9 commits March 10, 2026 10:35
Introduces a class that manages a contiguous virtual address range
backed by physical CUDA memory pages that can be grown and shrunk
at runtime using the CUDA Virtual Memory Management API
(cuMemAddressReserve / cuMemCreate / cuMemMap / cuMemUnmap).

Also adds Google Test unit tests and a usage example.

Files:
  cpp/include/tensorrt_llm/batch_manager/cuda_vmm_arena.h
  cpp/tensorrt_llm/batch_manager/cuda_vmm_arena.cpp
  cpp/tensorrt_llm/batch_manager/test_cuda_vmm_arena.cu
  cpp/tensorrt_llm/batch_manager/example.cu

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
Add SPDX copyright and Apache 2.0 license block to all four
CudaVmmArena source files per NVIDIA project conventions.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
Replace SPDX-tagged header with the canonical NVIDIA copyright
and Apache 2.0 license block.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
Use TRTLLM_CUDA_VMM_ARENA_H guard per project naming convention.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
Use single-space indent (` *`) for the Apache 2.0 license block in all
four cuda_vmm_arena source files, matching the NVIDIA copyright header
convention used elsewhere in the repo.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
Relocate the Google Test suite for CudaVmmArena from the implementation
directory to the canonical test location under cpp/tests/unit_tests/.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
Rename source files from snake_case to camelCase:
  cuda_vmm_arena.h        -> cudaVmmArena.h
  cuda_vmm_arena.cpp      -> cudaVmmArena.cpp
  test_cuda_vmm_arena.cu  -> testCudaVmmArena.cu

Update include guards and #include directives accordingly.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
…keLists

Add add_gtest entry for testCudaVmmArena.cu so the CudaVmmArena Google
Test suite is picked up by the CMake build system.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
The previous implementation performed a two-step check-then-act pattern
across two separate acquisitions of mSequencesMtx:

  1. Lock acquired, mSequences.find() result stored in `found`, lock released.
  2. If found, getSequence() re-acquires the lock and calls mSequences.at().

removeSequence() uses mSequences.extract() under the lock, so it can run
between steps 1 and 2, removing the sequence node. The subsequent .at()
call in getSequence() then throws std::out_of_range. Moreover, getSequence()
releases the lock before returning the reference, leaving the reference to
the GenerationRequest unprotected across the mBlockManager.storeContextBlocks()
call that follows.

Fix: hold mSequencesMtx for the entire find-and-use operation, bypassing the
redundant getSequence() call. mBlockManager.storeContextBlocks() does not
acquire mSequencesMtx, so no deadlock is introduced. removeSequence() will
simply block until storeContextBlocks() completes, which is the correct
serialization order.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
@thorjohnsen thorjohnsen requested a review from a team as a code owner March 11, 2026 19:30
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Introduces a CUDA Virtual Memory Manager arena implementation with virtual address reservation and on-demand physical page mapping via CUDA driver APIs, includes a usage example and comprehensive test suite, plus a thread-safety fix to kvCacheManager's sequence lookup.

Changes

Cohort / File(s) Summary
CUDA VMM Arena Core
cpp/include/tensorrt_llm/batch_manager/cudaVmmArena.h, cpp/tensorrt_llm/batch_manager/cudaVmmArena.cpp
Implements CudaVmmError exception and CudaVmmArena class managing fixed VA window reservations with on-demand physical page mapping, including grow/shrink/resize lifecycle methods, alignment utilities, and granularity-based chunk management backed by CUDA driver APIs (cuMemAddressReserve, cuMemCreate, cuMemMap, cuMemSetAccess).
CUDA VMM Arena Example & Tests
cpp/tensorrt_llm/batch_manager/example.cu, cpp/tests/unit_tests/batch_manager/testCudaVmmArena.cu, cpp/tests/unit_tests/batch_manager/CMakeLists.txt
Demonstrates arena usage flow (grow, shrink, resize phases with device kernel validation) and provides comprehensive Google Test suite covering construction, alignment, lifecycle operations, pointer stability, memory access verification, and multi-arena independence; registers new test target.
KVCache Thread-Safety Fix
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Adds extended lock scope around sequence lookup in storeContextBlocks to prevent TOCTOU race between existence check and subsequent access; replaces check-then-get pattern with single iterator lookup under lock.

Sequence Diagram(s)

sequenceDiagram
    actor App as Application
    participant Arena as CudaVmmArena
    participant CudaAPI as CUDA Driver
    participant Mem as Device Memory

    App->>Arena: CudaVmmArena(max_size, device)
    Arena->>CudaAPI: cuMemAddressReserve(VA window)
    Arena->>Arena: Store base_ptr, granularity

    App->>Arena: grow(new_size)
    Arena->>CudaAPI: cuMemCreate(physical pages)
    Note over Arena,CudaAPI: Allocate granularity-aligned chunks
    Arena->>CudaAPI: cuMemMap(VA window, physical pages)
    Arena->>CudaAPI: cuMemSetAccess(grant R/W access)
    Arena->>Arena: Update committed_size, store handles

    App->>Mem: Launch kernel write to arena VA
    Mem->>Mem: Access mapped physical memory

    App->>Arena: shrink(new_size)
    Arena->>CudaAPI: cuMemUnmap(VA → chunks)
    Arena->>CudaAPI: cuMemRelease(physical pages)
    Arena->>Arena: Update committed_size, remove handles

    App->>Arena: ~CudaVmmArena()
    Arena->>CudaAPI: Unmap & release all chunks
    Arena->>CudaAPI: cuMemAddressFree(VA window)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete and contains only the template with no filled-in content for the Description, Test Coverage, or checklist items. Provide a clear description of the actual changes made in this PR, explain the rationale behind the CudaVmmArena introduction and the TOCTOU fix in KVCacheManager, list the test coverage (testCudaVmmArena.cu, example.cu), and confirm checklist items.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title correctly identifies a TOCTOU race condition fix in KVCacheManager::storeContextBlocks, matching the primary change in kvCacheManager.cpp.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/cudaVmmArena.h`:
- Around line 1-18: The header prologue and include guard in cudaVmmArena.h do
not follow repo conventions; replace the current file header with the standard
TensorRT-LLM Apache/SPDX prologue including the correct NVIDIA copyright line
with the latest modification year, and change the include guard macro from
TRTLLM_CUDAVMMARENA_H to the required format TRTLLM_CUDA_VMM_ARENA_H so the
guard matches TRTLLM_<FILENAME_IN_CAPS>_H and aligns with project guidelines.

In `@cpp/tensorrt_llm/batch_manager/cudaVmmArena.cpp`:
- Around line 151-168: CudaVmmArena::grow currently leaves handles_ and
committed_size_ out of sync if map_chunk throws partway; modify grow to maintain
invariants by either advancing committed_size_ in lockstep after each successful
map_chunk call (set committed_size_ = offset + granularity_ or similar inside
the loop) or wrap the map loop in a try/catch that on exception iterates over
the newly-added handles_ for this invocation to unmap/close them and restore
committed_size_ to its pre-grow value before rethrowing; refer to
CudaVmmArena::grow, map_chunk, committed_size_, and handles_ when making the
fix.
- Around line 133-145: unmap_chunk currently ignores failures from
cuMemSetAccess, cuMemUnmap, and cuMemRelease, causing shrink() to pop handles_
and decrement committed_size_ even if unmapping/releasing fails; update
unmap_chunk (and the shrink path that calls it) to check return codes from
cuMemSetAccess, cuMemUnmap, and cuMemRelease, and on any failure: stop mutating
state (do not clear handles_[chunk_idx] or decrement committed_size_), propagate
or return the error to the caller so shrink can abort and preserve the tail
chunk, and only clear handles_[chunk_idx] and adjust committed_size_ after all
CUDA calls succeed; use the existing symbols (unmap_chunk, shrink,
cuMemSetAccess, cuMemUnmap, cuMemRelease, handles_, committed_size_, base_ptr_,
granularity_) to locate and implement the checks and error propagation.

In `@cpp/tests/unit_tests/batch_manager/CMakeLists.txt`:
- Line 30: The CMake target add_gtest for testCudaVmmArena currently links the
default gtest_main and omits the CUDA driver library which causes a duplicate
main symbol and missing CUDA driver symbols; update the add_gtest invocation for
the testCudaVmmArena target to pass the NO_GTEST_MAIN flag so it does not link
gtest_main (since testCudaVmmArena.cu defines its own main), and also link the
CUDA driver library by adding target_link_libraries(testCudaVmmArena PRIVATE
CUDA::cuda_driver) so calls like
cuInit/cuDeviceGet/cuCtxCreate/cuMemGetAllocationGranularity resolve.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa963e46-385c-4a50-a197-bc7bee8bf02a

📥 Commits

Reviewing files that changed from the base of the PR and between 0350b7f and 8c28d80.

📒 Files selected for processing (6)
  • cpp/include/tensorrt_llm/batch_manager/cudaVmmArena.h
  • cpp/tensorrt_llm/batch_manager/cudaVmmArena.cpp
  • cpp/tensorrt_llm/batch_manager/example.cu
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/CMakeLists.txt
  • cpp/tests/unit_tests/batch_manager/testCudaVmmArena.cu

Comment on lines +1 to +18
/*
* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#ifndef TRTLLM_CUDAVMMARENA_H
#define TRTLLM_CUDAVMMARENA_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align the new header prologue with the repo conventions.

The new file does not use the standard TensorRT-LLM Apache/SPDX prologue, and the guard should follow TRTLLM_<FILENAME_IN_CAPS>_H—for this file, TRTLLM_CUDA_VMM_ARENA_H.

As per coding guidelines, "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification. The header should be an Apache 2.0 license block as specified" and "Use a preprocessor guard in C++ header files with the format TRTLLM_<FILENAME_IN_CAPS>_H."

Also applies to: 145-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/batch_manager/cudaVmmArena.h` around lines 1 - 18,
The header prologue and include guard in cudaVmmArena.h do not follow repo
conventions; replace the current file header with the standard TensorRT-LLM
Apache/SPDX prologue including the correct NVIDIA copyright line with the latest
modification year, and change the include guard macro from TRTLLM_CUDAVMMARENA_H
to the required format TRTLLM_CUDA_VMM_ARENA_H so the guard matches
TRTLLM_<FILENAME_IN_CAPS>_H and aligns with project guidelines.

Comment on lines +133 to +145
void CudaVmmArena::unmap_chunk(size_t chunk_idx) {
const size_t offset = chunk_idx * granularity_;

// Revoke access before unmapping (required by the CUDA VMM spec).
CUmemAccessDesc no_access{};
no_access.location = alloc_prop_.location;
no_access.flags = CU_MEM_ACCESS_FLAGS_PROT_NONE;
cuMemSetAccess(base_ptr_ + offset, granularity_, &no_access, 1);

cuMemUnmap(base_ptr_ + offset, granularity_);
cuMemRelease(handles_[chunk_idx]);
handles_[chunk_idx] = CUmemGenericAllocationHandle{};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't hide CUDA VMM failures on the shrink path.

unmap_chunk() ignores the results of cuMemSetAccess, cuMemUnmap, and cuMemRelease, yet shrink() pops the handle and decrements committed_size_ as if the tail chunk was fully released. A checked path is needed here; otherwise the arena can report success while leaking or leaving pages mapped.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/cudaVmmArena.cpp` around lines 133 - 145,
unmap_chunk currently ignores failures from cuMemSetAccess, cuMemUnmap, and
cuMemRelease, causing shrink() to pop handles_ and decrement committed_size_
even if unmapping/releasing fails; update unmap_chunk (and the shrink path that
calls it) to check return codes from cuMemSetAccess, cuMemUnmap, and
cuMemRelease, and on any failure: stop mutating state (do not clear
handles_[chunk_idx] or decrement committed_size_), propagate or return the error
to the caller so shrink can abort and preserve the tail chunk, and only clear
handles_[chunk_idx] and adjust committed_size_ after all CUDA calls succeed; use
the existing symbols (unmap_chunk, shrink, cuMemSetAccess, cuMemUnmap,
cuMemRelease, handles_, committed_size_, base_ptr_, granularity_) to locate and
implement the checks and error propagation.

Comment on lines +151 to +168
void CudaVmmArena::grow(size_t new_size) {
const size_t aligned = align_up(new_size, granularity_);

if (aligned == 0)
throw CudaVmmError("grow(): new_size rounds to zero.");
if (aligned > max_size_)
throw CudaVmmError("grow(): new_size exceeds the reserved VA range.");
if (aligned <= committed_size_)
throw CudaVmmError("grow(): new_size must be larger than current committed_size.");

// Map chunks covering [committed_size_, aligned).
size_t offset = committed_size_;
while (offset < aligned) {
map_chunk(offset); // may throw; already-mapped chunks stay valid
offset += granularity_;
}

committed_size_ = aligned;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

grow() leaves the arena state inconsistent after a mid-grow failure.

If one map_chunk() succeeds and a later one throws, handles_ already contains extra mappings but committed_size_ is still the old value. The next grow() or shrink() call will then reason from stale state. Roll back the chunks mapped in this call, or advance committed_size_ in lockstep with each successful mapping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/cudaVmmArena.cpp` around lines 151 - 168,
CudaVmmArena::grow currently leaves handles_ and committed_size_ out of sync if
map_chunk throws partway; modify grow to maintain invariants by either advancing
committed_size_ in lockstep after each successful map_chunk call (set
committed_size_ = offset + granularity_ or similar inside the loop) or wrap the
map loop in a try/catch that on exception iterates over the newly-added handles_
for this invocation to unmap/close them and restore committed_size_ to its
pre-grow value before rethrowing; refer to CudaVmmArena::grow, map_chunk,
committed_size_, and handles_ when making the fix.

add_gtest(peftCacheManagerTest peftCacheManagerTest.cpp)
add_gtest(staticThreadPoolTest staticThreadPoolTest.cpp)
add_gtest(rnnCacheFormatterTest rnnCacheFormatterTest.cpp)
add_gtest(testCudaVmmArena testCudaVmmArena.cu)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== add_gtest implementation =="
python - <<'PY'
import pathlib, re
pattern = re.compile(r'(?ims)^\s*(function|macro)\s*\(\s*add_gtest\b.*?^\s*end(?:function|macro)\s*\)')
for path in pathlib.Path('.').rglob('*'):
    if path.is_file() and (path.name == 'CMakeLists.txt' or path.suffix == '.cmake'):
        text = path.read_text(errors='ignore')
        for match in pattern.finditer(text):
            print(f'--- {path} ---')
            print(match.group(0))
PY

echo
echo "== target registration and custom main =="
rg -n -C2 'add_gtest\s*\(\s*testCudaVmmArena\b|^\s*int\s+main\s*\(' -g 'CMakeLists.txt' -g '*.cmake' -g '*testCudaVmmArena.cu'

echo
echo "== possible gtest_main / CUDA driver linkage =="
rg -n -C2 'gtest_main|CUDA::cuda_driver|cuda_driver|[/-]lcuda' -g 'CMakeLists.txt' -g '*.cmake'

Repository: NVIDIA/TensorRT-LLM

Length of output: 5745


🏁 Script executed:

# Search for add_gtest definition in cmake files
fd -e cmake | head -20 | xargs grep -l "add_gtest" 2>/dev/null

Repository: NVIDIA/TensorRT-LLM

Length of output: 47


🏁 Script executed:

# Search for add_gtest definition more broadly
rg -n "^\s*(function|macro)\s*\(\s*add_gtest" --type cmake

Repository: NVIDIA/TensorRT-LLM

Length of output: 228


🏁 Script executed:

# Check the batch_manager CMakeLists.txt more thoroughly
cat -n cpp/tests/unit_tests/batch_manager/CMakeLists.txt | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 1748


🏁 Script executed:

# Check if testCudaVmmArena.cu actually uses CUDA driver API
rg -n "CUDA|driver|cuCtx|cuMem|cuLaunch" cpp/tests/unit_tests/batch_manager/testCudaVmmArena.cu | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 746


🏁 Script executed:

sed -n '35,80p' cpp/tests/CMakeLists.txt

Repository: NVIDIA/TensorRT-LLM

Length of output: 1488


🏁 Script executed:

# Check if batch_manager has any additional linkage setup
rg -n "target_link_libraries|batch_manager" cpp/tests/unit_tests/batch_manager/CMakeLists.txt

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check if batch_manager CMakeLists has the full context - look for any target setup after line 30
wc -l cpp/tests/unit_tests/batch_manager/CMakeLists.txt

Repository: NVIDIA/TensorRT-LLM

Length of output: 115


🏁 Script executed:

# Get the full batch_manager CMakeLists.txt
cat cpp/tests/unit_tests/batch_manager/CMakeLists.txt

Repository: NVIDIA/TensorRT-LLM

Length of output: 1538


Use NO_GTEST_MAIN flag and link CUDA driver library for this test.

testCudaVmmArena.cu defines its own main() and uses CUDA driver API (cuInit, cuDeviceGet, cuCtxCreate, cuMemGetAllocationGranularity). Without the NO_GTEST_MAIN flag, the default add_gtest will link gtest_main, causing a linker error (duplicate main symbol). Additionally, the CUDA driver library (CUDA::cuda_driver) must be explicitly linked:

add_gtest(testCudaVmmArena testCudaVmmArena.cu NO_GTEST_MAIN)
target_link_libraries(testCudaVmmArena PRIVATE CUDA::cuda_driver)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/batch_manager/CMakeLists.txt` at line 30, The CMake
target add_gtest for testCudaVmmArena currently links the default gtest_main and
omits the CUDA driver library which causes a duplicate main symbol and missing
CUDA driver symbols; update the add_gtest invocation for the testCudaVmmArena
target to pass the NO_GTEST_MAIN flag so it does not link gtest_main (since
testCudaVmmArena.cu defines its own main), and also link the CUDA driver library
by adding target_link_libraries(testCudaVmmArena PRIVATE CUDA::cuda_driver) so
calls like cuInit/cuDeviceGet/cuCtxCreate/cuMemGetAllocationGranularity resolve.

@schetlur-nv schetlur-nv requested review from schetlur-nv and removed request for schetlur-nv March 11, 2026 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant