[sync] upstream llm-d main branch 1c435cb [2026-04-02] by zdtsw · Pull Request #16 · opendatahub-io/llm-d-kv-cache

zdtsw · 2026-04-02T08:34:57Z

Syncs llm-d/llm-d-kv-cache main branch into ODH main_2 branch.

Upstream commit: llm-d@1c435cb

this maps to https://github.com/llm-d/llm-d-kv-cache/releases/tag/v0.7.0

Summary by CodeRabbit

Release Notes

New Features
- Added GPUDirect Storage (GDS) support for GPU-accelerated file I/O with configurable read/write and bounce-buffer modes.
Improvements
- Enhanced vLLM event decoding for improved compatibility.
- Improved tokenizer timeout handling for multimedia content requests.
Documentation
- Added comprehensive GDS setup guides and configuration examples.
Tests
- Added GDS backend tests and vLLM event decoding benchmarks.

Multimodal requests need to download and process images, which can exceed the 5s default timeout in CI. Use 30s for requests with structured content parts.

* feat: add GPUDirect Storage (GDS) support to llmd_fs_backend Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>

…ation (llm-d#475)

Granite's built-in chat template injects the current date via strftime_now(), causing goldenChatTokenIDs to change daily. Add an explicit system message to the golden chat conversation so the template uses the provided content instead of generating a date-dependent default system prompt.

with vLLM event schema (llm-d#484) * fix: single-pass []any decode for forward/backward compat with vLLM event schema vLLM uses msgspec with array_like=True and omit_defaults=True, producing positional msgpack arrays where trailing fields may be absent. The previous typed-struct decode broke when vLLM appended new fields (old consumer fails) or when a newer consumer read from an older vLLM (shorter array than expected). Replace double-decode ([]any for tag + typed struct) with a single unmarshal into []any and positional extraction with length guards. Extra trailing fields from newer vLLM are silently ignored; missing trailing fields from older vLLM get zero values. * test: add decode benchmarks for vLLM event schema compat * fix: address review — reuse shared helpers, fix lint - Reuse convertBlockHashes() from common.go in both vLLM converters - Move convertExtraKeys() back to common.go (shared with SGLang) - Remove unused engineName param from decodeEvent() - Fix bench test lint: error checks, paramTypeCombine, appendCombine * fix: suppress gosec G115 in bench test data construction

coderabbitai · 2026-04-02T08:35:18Z

📝 Walkthrough

Walkthrough

This PR introduces GPUDirect Storage (GDS) support to enable direct GPU-to-file I/O via NVIDIA cuFile. Changes include: (1) dynamic runtime loading of libcufile.so via dlopen with fallback symbol resolution; (2) a new StorageHandler abstract interface with FileIO (CPU-staged) and GdsFileIO (GDS-backed) implementations; (3) mode-dependent storage dispatch in StorageOffloadEngine selecting between CPU and GDS paths; (4) Python binding updates to accept a gds_mode configuration parameter; (5) kvevents adapter refactoring to support msgspec positional decoding for vLLM events; (6) supporting infrastructure including enums (StorageMode, GdsMode), documentation, and tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Security & Design Findings

CWE-427 (Untrusted Search Path): cufile_loader.hpp uses dlopen("libcufile.so") and dlopen("libcufile.so.0") without absolute paths or integrity validation. No constraints on LD_LIBRARY_PATH or library version. Recommend: validate library signature or restrict search to /usr/lib64 or require absolute path via environment variable.

CWE-252 (Unchecked Return Value): dlsym() calls in cufile_loader.hpp resolve function pointers but leave them nullptr on failure without distinguishing partial load failures from total failure. Callers cannot reliably detect which functions are unavailable. Consider explicit per-function error logging at load time.

CWE-367 (Time-of-Check-Time-of-Use): gds_file_io.cpp writes to temporary file then renames atomically (good), but no validation that parent directory creation succeeds before write. std::filesystem::create_directories errors are not explicitly checked.

Design Issue: StorageOffloadEngine constructor now requires gds_mode string that is parsed in spec.py but validated only in worker.py after construction. Prefer early validation at binding time or factory method. String-based mode dispatch in parse_gds_mode() has no exhaustive pattern matching—falling back to DISABLED silently masks typos.

Architectural Concern: TensorCopier now exposes accessors (get_tensors(), get_block_size()) to enable GdsFileIO initialization, breaking encapsulation. Consider passing these via constructor parameters instead of exposing internal state.

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	Title references a specific upstream commit hash (`1c435cb`) and date (2026-04-02), indicating a sync operation. However, it does not clearly convey what the main substantive changes are—the PR includes GDS support, vLLM event schema changes, timeout tuning, and golden tests, but the title obscures these behind a generic sync pattern.	Consider expanding the title to highlight the primary feature addition (e.g., '[sync] Add GDS support and vLLM event compatibility updates from upstream') or clarifying the scope beyond the commit hash reference.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp (1)

131-142: ⚠️ Potential issue | 🟠 Major

gds_mode parameter is unused; staging buffer allocated even for full-GDS modes.

The function signature accepts GdsMode gds_mode but never references it. Per the header comment ("0 for full-GDS modes"), this should return 0 when gds_mode indicates direct GDS transfers, avoiding unnecessary memory allocation.

Proposed fix

 size_t StorageOffloadEngine::calc_staging_bytes(
     int gpu_blocks_per_file,
     const std::vector<torch::Tensor>& tensors,
     GdsMode gds_mode) {
+  // Full-GDS modes don't need CPU staging buffers
+  if (gds_mode == GdsMode::READ_WRITE) {
+    return 0;
+  }
   size_t block_size_in_bytes = 0;
   for (const auto& tensor : tensors) {
     block_size_in_bytes += static_cast<size_t>(tensor.stride(0)) *
                            static_cast<size_t>(tensor.element_size());
   }
   return block_size_in_bytes * static_cast<size_t>(gpu_blocks_per_file);
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp` around lines
131 - 142, The calc_staging_bytes function currently ignores the gds_mode
parameter and always computes a staging buffer size; update
StorageOffloadEngine::calc_staging_bytes to return 0 when gds_mode indicates
full-GDS/direct GDS transfer mode (per the header note "0 for full-GDS modes")
and only compute the byte-size loop for non-GDS modes; locate the function by
name and add a conditional branch that checks the appropriate GdsMode enum/flag
for full-GDS before performing the tensor stride*element_size accumulation so
staging memory is not allocated for direct-GDS transfers.

🧹 Nitpick comments (12)

pkg/tokenization/uds_tokenizer.go (2)

76-78: Reuse mmTimeout in warmup() for consistency.

Line 200 hardcodes 30*time.Second for the warmup context timeout. Consider using mmTimeout to keep the value in sync.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tokenization/uds_tokenizer.go` around lines 76 - 78, The warmup()
function currently hardcodes a 30*time.Second context timeout; replace that
literal with the mmTimeout constant to keep timeout values consistent. Locate
the context creation in warmup() (e.g., context.WithTimeout(...,
30*time.Second)) and change the timeout argument to mmTimeout so the function
uses the shared mmTimeout variable defined near the top of the file.
284-291: Parent context deadline ignored; gRPC call may outlive caller's timeout.

context.Background() is hardcoded, so if upstream callers (e.g., HTTP handlers, orchestrators) have their own deadlines, this 30s timeout will not respect them. A slow backend could hold resources long after the original request is abandoned.

Consider accepting a parent context.Context parameter and deriving the timeout from it:
ctx, cancel := context.WithTimeout(parentCtx, timeout)
This requires a signature change to RenderChat (and the Tokenizer interface), which may be out of scope for this PR. If so, document this limitation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tokenization/uds_tokenizer.go` around lines 284 - 291, The RenderChat
call is creating a new context from context.Background(), ignoring upstream
deadlines; change the API to accept a parent context.Context and derive the
timeout from it (use ctx, cancel := context.WithTimeout(parentCtx, timeout)),
update the Tokenizer interface signature (Tokenizer.RenderChat) and all callers
to pass through the caller's context, and ensure cancel() is deferred; if
changing signatures is out of scope, add a clear code comment on RenderChat and
the Tokenizer interface describing this limitation and the potential resource
leak so it is documented for future work.

pkg/kvevents/engineadapter/vllm_adapter.go (1)

293-316: Missing int case in type switch.

toInt handles int8-int64 and uint8-uint64 but not the base int type. While msgpack v5 typically decodes integers as int64/uint64, if any code path passes a Go int, this will fail with a confusing "unsupported numeric type: int" error.
Add int case for completeness
 func toInt(raw any) (int, error) {
 	switch v := raw.(type) {
+	case int:
+		return v, nil
 	case int64:
 		return int(v), nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/kvevents/engineadapter/vllm_adapter.go` around lines 293 - 316, The toInt
function is missing a case for the native Go int type which causes "unsupported
numeric type: int" errors; update the type switch in toInt (function toInt) to
include a case for int that returns int(v), nil (you can add a comment similar
to the uint64 case about overflow not being a concern) so the function handles
int values alongside int8/int16/int32/int64 and the unsigned variants.

kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp (3)

144-149: utimensat return value ignored—silent failure on atime update.

If utimensat fails (e.g., permissions, non-existent file), the error is silently discarded. Consider logging at DEBUG level for troubleshooting.

Proposed fix

 void FileIO::update_atime(const std::string& path) {
   struct timespec times[2];
   times[0].tv_nsec = UTIME_NOW;   // atime → now
   times[1].tv_nsec = UTIME_OMIT;  // mtime → unchanged
-  utimensat(AT_FDCWD, path.c_str(), times, 0);
+  if (utimensat(AT_FDCWD, path.c_str(), times, 0) != 0) {
+    FS_LOG_DEBUG("Failed to update atime for: " << path << " - "
+                                                 << std::strerror(errno));
+  }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp` around
lines 144 - 149, FileIO::update_atime currently calls utimensat and ignores its
return value; change it to check the return value and, on error (return < 0),
log a DEBUG-level message including the path and strerror(errno) (or errno) to
aid troubleshooting; ensure you include context (function
name/FileIO::update_atime and the path) in the log and keep behavior otherwise
unchanged (no exception thrown) so callers remain unaffected.

66-74: pubsetbuf after opening has implementation-defined behavior.

Per C++ standard, calling pubsetbuf after a file is already open may be ignored by some implementations. Move the buffer setup before open() or use the two-argument std::ofstream constructor after setting the buffer.

Proposed fix

-  std::ofstream ofs(tmp_path, std::ios::out | std::ios::binary);
-  if (!ofs) {
+  std::ofstream ofs;
+  ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE);
+  ofs.open(tmp_path, std::ios::out | std::ios::binary);
+  if (!ofs.is_open()) {
     FS_LOG_ERROR("Failed to open temporary file for writing: "
                  << tmp_path << " - " << std::strerror(errno));
     return false;
   }
-
-  // Apply the custom buffer to the file stream
-  ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp` around
lines 66 - 74, The call to ofs.rdbuf()->pubsetbuf(...) occurs after the
std::ofstream is opened which is implementation-defined and may be ignored;
change to set the buffer on the streambuf before opening the file (use a
default-constructed std::ofstream ofs, call
ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE), then call
ofs.open(tmp_path, std::ios::out | std::ios::binary) and check the stream), or
alternatively use a constructor/approach that ensures the buffer is installed
prior to opening; update the code referencing tmp_path, ofs,
thread_write_buffer, and WRITE_BUFFER_SIZE accordingly.

44-45: Potential tmp file collision across processes using same storage path.

std::random_device{}() returns a 32-bit value. While thread-local ensures per-thread uniqueness, multiple processes could generate identical suffixes. Consider including PID:

Proposed fix

+#include <unistd.h>
+
 thread_local std::string tmp_file_suffix =
-    "_" + std::to_string(std::random_device{}()) + ".tmp";
+    "_" + std::to_string(getpid()) + "_" +
+    std::to_string(std::random_device{}()) + ".tmp";

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp` around
lines 44 - 45, The thread-local tmp_file_suffix in file_io.cpp uses
std::random_device{}() which can collide across processes; modify construction
of tmp_file_suffix (the thread_local std::string) to include the current process
identifier plus a higher-entropy random value so suffixes are unique across
processes (e.g., concatenate "_" + process id + "_" + a 64-bit RNG value +
".tmp"). Use appropriate API for PID on each platform (getpid() on POSIX or
GetCurrentProcessId() on Windows) or fall back to
std::hash<std::thread::id>()/timestamp as needed, and seed a std::mt19937_64
with std::random_device for the random portion. Ensure the change preserves
thread_local semantics and the variable name tmp_file_suffix remains the same.

kv_connectors/llmd_fs_backend/docs/cufile_rdma.json (1)

69-69: Placeholder values must be replaced before use.

rdma_dev_addr_list contains <RDMA_NIC_IP_1> placeholders. This will cause cuFile parse failures if not replaced. The instructions at lines 4-5 are clear, but consider adding a startup validation in the GDS initialization code to detect and warn about placeholder values.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json` at line 69, The
rdma_dev_addr_list JSON contains placeholder entries like "<RDMA_NIC_IP_1>"
which will break cuFile parsing; add a startup validation in the GDS
initialization path (e.g., the function that performs GDS
initialization/registration such as initGDS/initializeGDS or the routine that
loads cuFile/GDS config) that inspects the rdma_dev_addr_list array and fails
early or logs a clear error if any entry matches the placeholder pattern (e.g.,
contains '<' or matches /<.*>/ or known tokens like "<RDMA_NIC_IP_"),
instructing the user to replace them; make this validation produce a fatal error
or warning with actionable text so misconfigured JSON is detected before cuFile
is invoked.

kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp (1)

54-57: Docstring formatting issue: missing newline before gds_mode description.

Line 55-56 runs read_ratio) directly into gds_mode: without a separating newline.

Proposed fix

            "  read_preferring_workers: Number of workers that check "
-           "  read queue first (calculated as int(io_threads * read_ratio) "
-           "  gds_mode: GDS operation mode (see GdsMode in storage_types.hpp). "
+           "  read queue first (calculated as int(io_threads * read_ratio)).\n"
+           "  gds_mode: GDS operation mode (see GdsMode in storage_types.hpp). "
            "Defaults to 'disabled'.\n")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp`
around lines 54 - 57, The docstring in storage_offload_bindings.cpp concatenates
"read_ratio)" and "gds_mode:" without a newline; update the string literal so
there is a newline (e.g., "\n") before the "gds_mode: GDS operation mode..."
description to separate the paragraphs referring to
read_preferring_workers/read_ratio and gds_mode; locate the multi-line docstring
containing "read_preferring_workers" and "gds_mode" and insert the missing
newline in that literal.

kv_connectors/llmd_fs_backend/docs/gds.md (1)

76-78: Fenced code blocks should specify a language.

Add text or log as the language identifier for example output blocks to satisfy markdown linting and improve rendering.
Proposed fix
-```
+```text
 GdsFileIO: GPUDirect Storage (GDS) enabled
If GDS initialization fails, the connector automatically falls back to CPU staging and logs:

- +text
StorageOffloadEngine: GDS initialization failed, falling back to CPU_BUFFER_STAGE for both READ and WRITE
StorageOffloadEngine: READ=CPU WRITE=CPU
Also applies to: 82-85
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/docs/gds.md` around lines 76 - 78, The markdown
example output blocks for GdsFileIO and the StorageOffloadEngine logs are
missing a fenced-code language; update the two fenced blocks containing
"GdsFileIO: GPUDirect Storage (GDS) enabled" and the block with
"StorageOffloadEngine: GDS initialization failed..." / "StorageOffloadEngine:
READ=CPU WRITE=CPU" to use a language identifier (e.g., add ```text or ```log)
so the code fences include a language for proper linting and rendering; locate
the blocks by the unique text snippets "GdsFileIO: GPUDirect Storage (GDS)
enabled" and "StorageOffloadEngine: GDS initialization failed, falling back to
CPU_BUFFER_STAGE" when making the change.

kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp (1)

104-110: Silent failure when dlopen fails; no diagnostic output.

When libcufile.so cannot be loaded, the code returns silently with m_handle = nullptr. Adding dlerror() output would help diagnose missing library paths or dependency issues in production.

Log dlopen failure reason

   CuFileApi() {
     m_handle = dlopen("libcufile.so", RTLD_NOW);
     if (!m_handle) {
       // Try versioned name
       m_handle = dlopen("libcufile.so.0", RTLD_NOW);
     }
-    if (!m_handle) return;
+    if (!m_handle) {
+      // Optionally log: dlerror() contains the failure reason
+      return;
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`
around lines 104 - 110, The CuFileApi constructor currently returns silently
when dlopen fails (m_handle == nullptr); update CuFileApi to call dlerror() and
log or print the returned error string when both dlopen("libcufile.so",
RTLD_NOW) and dlopen("libcufile.so.0", RTLD_NOW) fail so you get diagnostic
output about why the library couldn't be loaded; locate the CuFileApi
constructor and add a diagnostic message referencing dlerror() (and the
attempted names "libcufile.so" / "libcufile.so.0") before returning so failures
are visible at runtime.

kv_connectors/llmd_fs_backend/tests/test_gds_backend.py (2)

30-34: Exception handling swallows all errors silently.

Catching bare Exception and returning False loses diagnostic information. If lsmod or ldconfig fails for unexpected reasons (e.g., permissions), the failure reason is hidden.

Log the exception for debugging

     try:
         result = subprocess.run(["lsmod"], capture_output=True, text=True, check=False)
         nvidia_fs_loaded = "nvidia_fs" in result.stdout
-    except Exception:
+    except Exception as e:
+        # Log for debugging; proceed without nvidia_fs detection
         nvidia_fs_loaded = False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/tests/test_gds_backend.py` around lines 30 -
34, The test currently swallows all errors in the try/except around the
subprocess.run call that sets nvidia_fs_loaded, making failures silent; update
the except block in the same try that invokes subprocess.run(["lsmod"]) to log
the caught exception (include the exception details/traceback) before setting
nvidia_fs_loaded = False — use the project logger or Python's logging with
logger.exception() or logging.error(..., exc_info=True) so the failure reason is
recorded for debugging while preserving the fallback behavior.

60-73: Minor inefficiency: duplicate status collection.

check_gds_available() and get_gds_status_message() both call _collect_gds_status(). In test_gds_available() (lines 81-82), both are invoked, resulting in redundant subprocess calls. Consider returning status from check_gds_available() or caching the result.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/tests/test_gds_backend.py` around lines 60 -
73, The two functions call _collect_gds_status() twice causing redundant work;
change check_gds_available() to return the collected status dict (or store it in
a module-level cached variable) instead of just a bool, then update
get_gds_status_message() and test_gds_available() to reuse that returned status
(or read the cached value) so _collect_gds_status() is invoked only once;
reference the functions check_gds_available, get_gds_status_message, and
_collect_gds_status when making the changes and adjust test_gds_available() to
accept the new return shape or use the cache.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`:
- Around line 40-58: The CUfileDrvProps_t definition here mismatches the
official NVIDIA cuFile ABI and can corrupt memory when
cuFileDriverGetProperties() populates it; replace the local struct with the
exact definition from NVIDIA's cuFile headers (or include the vendor header
directly) so fields like the nested nvfs struct, CUfileFeatureFlags_t fflags,
unsigned int-typed fields (e.g., max_pinned_memory_size, per_buffer_cache_size),
and any optional max_batch_io_timeout_msecs are present and correctly typed and
ordered; keep CUfileDescr_t as-is but remove or replace the _reserved[256]
placeholder and ensure CUfileDrvProps_t matches the vendor header exactly.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`:
- Around line 215-228: The loop that registers GPU buffer chunks (using
num_chunks, chunk_size, size, block_ptr and calling m_cufile.cuFileBufRegister)
can overread when size isn't a multiple of chunk_size; change the registration
to compute remaining = size - (i * chunk_size) and set register_len =
std::min(chunk_size, remaining), call cuFileBufRegister with register_len (not
chunk_size) and store register_len in m_registered_buffers[block_ptr] so the
last block is clamped to the actual remaining bytes.
- Around line 32-34: The thread-local temporary suffix gds_tmp_suffix is
vulnerable to collisions because it uses only std::random_device{}() (32-bit);
change its generation to include the calling thread's identity and stronger
uniqueness (for example combine std::this_thread::get_id() and a per-thread or
global atomic counter or a larger random value) so each thread produces a
distinct suffix; update the thread_local gds_tmp_suffix initialization (the
symbol gds_tmp_suffix in gds_file_io.cpp) to concatenate thread id and an added
unique token to avoid temp file collisions when writing the same file_path.
- Around line 259-266: The current use of O_DIRECT when opening the temp file
(open(..., O_DIRECT)) can cause EINVAL in cuFileRead/cuFileWrite if
tensor.data_ptr() + (gpu_block_idx * block_size) or block_size is not aligned to
filesystem block boundaries; update the Open+IO path in GdsFileIO so it first
verifies alignment (ensure block_size and computed actual_gpu_ptr are aligned to
512/4096 as required) and if not aligned or if open with O_DIRECT or the cuFile
calls return EINVAL, retry the operation by reopening the file without O_DIRECT
and performing standard buffered I/O as a fallback; add explicit checks around
the arithmetic that computes actual_gpu_ptr (referencing tensor.data_ptr() and
gpu_block_idx) and improve error logging in the cuFileWrite/cuFileRead error
handlers to distinguish alignment errors vs other errno values so the code can
decide to retry without O_DIRECT.
- Around line 340-346: GdsFileIO::read_blocks_from_file currently ignores the
provided cudaStream_t and performs cuFileRead without GPU synchronization;
update this function to synchronize GPU work on the target device buffers
before/after the cuFileRead (mirror the write path and FileIO behavior) by
calling cudaStreamWaitEvent(stream, gpu_kvs_ready_event) before issuing the read
or calling cudaStreamSynchronize(stream) after the read completes (or both if
needed) to prevent concurrent kernel access to partially-written data; reference
GdsFileIO::read_blocks_from_file, gpu_kvs_ready_event, cudaStreamWaitEvent,
cudaStreamSynchronize, async_store_gpu_blocks, and FileIO::read_blocks_from_file
when making the change.

In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp`:
- Around line 310-327: The read path in storage_offload.cpp does not update
access time after a successful GDS read, which breaks LRU eviction; after a
successful TIME_EXPR_THROUGHPUT call that sets success from
m_read_handler->read_blocks_from_file (the block read branch in the function
where src_file and block_ids are used), call the FileIO::update_atime for the
same src_file (or invoke the appropriate update_atime method on the file IO
implementation such as GdsFileIO) before returning; ensure the update is
performed only when success is true and handle/log any update failures similarly
to the write path's update_atime usage so atime is kept in sync for LRU
eviction.

In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json`:
- Around line 1-12: The cufile_rdma.json template contains // comments which
make it invalid JSON for cuFile; update the file cufile_rdma.json so it is
strict JSON with all comments removed (leave only JSON key/value entries), or
instead convert this file into a .md (documentation) and add a separate valid
cufile_rdma.json without comments; ensure any references to CUFILE_ENV_PATH_JSON
in the content remain unchanged and the final JSON parses cleanly with no //
comment lines.

In `@kv_connectors/llmd_fs_backend/docs/gds.md`:
- Line 8: The link fragment in gds.md uses the wrong anchor
`#tuning-cufile-cufilejson`; update it to match the generated GitHub anchor for
the heading `## Tuning cuFile (`cufile_rdma.json`)` by changing the fragment to
`#tuning-cufile-cufile_rdmajson` (or alternatively rename the heading to remove
the underscore to match the existing link). Locate the incorrect fragment string
in the link and replace it with the corrected anchor
`#tuning-cufile-cufile_rdmajson` (or adjust the heading text `Tuning cuFile
(cufile_rdma.json)`) so the link resolves.

In `@kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py`:
- Line 253: The constructor for the worker class (the __init__ method in
llmd_fs_backend/worker.py) makes gds_mode a required parameter which breaks
callers; change the signature to give gds_mode a default of "disabled" (e.g.,
gds_mode: str = "disabled") so callers are backward compatible and consistent
with the C++ binding default, and update any internal references or docstrings
that assume a provided value if necessary.

---

Outside diff comments:
In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp`:
- Around line 131-142: The calc_staging_bytes function currently ignores the
gds_mode parameter and always computes a staging buffer size; update
StorageOffloadEngine::calc_staging_bytes to return 0 when gds_mode indicates
full-GDS/direct GDS transfer mode (per the header note "0 for full-GDS modes")
and only compute the byte-size loop for non-GDS modes; locate the function by
name and add a conditional branch that checks the appropriate GdsMode enum/flag
for full-GDS before performing the tensor stride*element_size accumulation so
staging memory is not allocated for direct-GDS transfers.

---

Nitpick comments:
In
`@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`:
- Around line 104-110: The CuFileApi constructor currently returns silently when
dlopen fails (m_handle == nullptr); update CuFileApi to call dlerror() and log
or print the returned error string when both dlopen("libcufile.so", RTLD_NOW)
and dlopen("libcufile.so.0", RTLD_NOW) fail so you get diagnostic output about
why the library couldn't be loaded; locate the CuFileApi constructor and add a
diagnostic message referencing dlerror() (and the attempted names "libcufile.so"
/ "libcufile.so.0") before returning so failures are visible at runtime.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp`:
- Around line 144-149: FileIO::update_atime currently calls utimensat and
ignores its return value; change it to check the return value and, on error
(return < 0), log a DEBUG-level message including the path and strerror(errno)
(or errno) to aid troubleshooting; ensure you include context (function
name/FileIO::update_atime and the path) in the log and keep behavior otherwise
unchanged (no exception thrown) so callers remain unaffected.
- Around line 66-74: The call to ofs.rdbuf()->pubsetbuf(...) occurs after the
std::ofstream is opened which is implementation-defined and may be ignored;
change to set the buffer on the streambuf before opening the file (use a
default-constructed std::ofstream ofs, call
ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE), then call
ofs.open(tmp_path, std::ios::out | std::ios::binary) and check the stream), or
alternatively use a constructor/approach that ensures the buffer is installed
prior to opening; update the code referencing tmp_path, ofs,
thread_write_buffer, and WRITE_BUFFER_SIZE accordingly.
- Around line 44-45: The thread-local tmp_file_suffix in file_io.cpp uses
std::random_device{}() which can collide across processes; modify construction
of tmp_file_suffix (the thread_local std::string) to include the current process
identifier plus a higher-entropy random value so suffixes are unique across
processes (e.g., concatenate "_" + process id + "_" + a 64-bit RNG value +
".tmp"). Use appropriate API for PID on each platform (getpid() on POSIX or
GetCurrentProcessId() on Windows) or fall back to
std::hash<std::thread::id>()/timestamp as needed, and seed a std::mt19937_64
with std::random_device for the random portion. Ensure the change preserves
thread_local semantics and the variable name tmp_file_suffix remains the same.

In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp`:
- Around line 54-57: The docstring in storage_offload_bindings.cpp concatenates
"read_ratio)" and "gds_mode:" without a newline; update the string literal so
there is a newline (e.g., "\n") before the "gds_mode: GDS operation mode..."
description to separate the paragraphs referring to
read_preferring_workers/read_ratio and gds_mode; locate the multi-line docstring
containing "read_preferring_workers" and "gds_mode" and insert the missing
newline in that literal.

In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json`:
- Line 69: The rdma_dev_addr_list JSON contains placeholder entries like
"<RDMA_NIC_IP_1>" which will break cuFile parsing; add a startup validation in
the GDS initialization path (e.g., the function that performs GDS
initialization/registration such as initGDS/initializeGDS or the routine that
loads cuFile/GDS config) that inspects the rdma_dev_addr_list array and fails
early or logs a clear error if any entry matches the placeholder pattern (e.g.,
contains '<' or matches /<.*>/ or known tokens like "<RDMA_NIC_IP_"),
instructing the user to replace them; make this validation produce a fatal error
or warning with actionable text so misconfigured JSON is detected before cuFile
is invoked.

In `@kv_connectors/llmd_fs_backend/docs/gds.md`:
- Around line 76-78: The markdown example output blocks for GdsFileIO and the
StorageOffloadEngine logs are missing a fenced-code language; update the two
fenced blocks containing "GdsFileIO: GPUDirect Storage (GDS) enabled" and the
block with "StorageOffloadEngine: GDS initialization failed..." /
"StorageOffloadEngine: READ=CPU WRITE=CPU" to use a language identifier (e.g.,
add ```text or ```log) so the code fences include a language for proper linting
and rendering; locate the blocks by the unique text snippets "GdsFileIO:
GPUDirect Storage (GDS) enabled" and "StorageOffloadEngine: GDS initialization
failed, falling back to CPU_BUFFER_STAGE" when making the change.

In `@kv_connectors/llmd_fs_backend/tests/test_gds_backend.py`:
- Around line 30-34: The test currently swallows all errors in the try/except
around the subprocess.run call that sets nvidia_fs_loaded, making failures
silent; update the except block in the same try that invokes
subprocess.run(["lsmod"]) to log the caught exception (include the exception
details/traceback) before setting nvidia_fs_loaded = False — use the project
logger or Python's logging with logger.exception() or logging.error(...,
exc_info=True) so the failure reason is recorded for debugging while preserving
the fallback behavior.
- Around line 60-73: The two functions call _collect_gds_status() twice causing
redundant work; change check_gds_available() to return the collected status dict
(or store it in a module-level cached variable) instead of just a bool, then
update get_gds_status_message() and test_gds_available() to reuse that returned
status (or read the cached value) so _collect_gds_status() is invoked only once;
reference the functions check_gds_available, get_gds_status_message, and
_collect_gds_status when making the changes and adjust test_gds_available() to
accept the new return shape or use the cache.

In `@pkg/kvevents/engineadapter/vllm_adapter.go`:
- Around line 293-316: The toInt function is missing a case for the native Go
int type which causes "unsupported numeric type: int" errors; update the type
switch in toInt (function toInt) to include a case for int that returns int(v),
nil (you can add a comment similar to the uint64 case about overflow not being a
concern) so the function handles int values alongside int8/int16/int32/int64 and
the unsigned variants.

In `@pkg/tokenization/uds_tokenizer.go`:
- Around line 76-78: The warmup() function currently hardcodes a 30*time.Second
context timeout; replace that literal with the mmTimeout constant to keep
timeout values consistent. Locate the context creation in warmup() (e.g.,
context.WithTimeout(..., 30*time.Second)) and change the timeout argument to
mmTimeout so the function uses the shared mmTimeout variable defined near the
top of the file.
- Around line 284-291: The RenderChat call is creating a new context from
context.Background(), ignoring upstream deadlines; change the API to accept a
parent context.Context and derive the timeout from it (use ctx, cancel :=
context.WithTimeout(parentCtx, timeout)), update the Tokenizer interface
signature (Tokenizer.RenderChat) and all callers to pass through the caller's
context, and ensure cancel() is deferred; if changing signatures is out of
scope, add a clear code comment on RenderChat and the Tokenizer interface
describing this limitation and the potential resource leak so it is documented
for future work.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 3b385448-d5b4-451c-88f4-383658914f72

📥 Commits

Reviewing files that changed from the base of the PR and between 4b96290 and 81290a7.

📒 Files selected for processing (33)

.gitignore
kv_connectors/llmd_fs_backend/Makefile
kv_connectors/llmd_fs_backend/README.md
kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp
kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp
kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.hpp
kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp
kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.hpp
kv_connectors/llmd_fs_backend/csrc/storage/file_io.hpp
kv_connectors/llmd_fs_backend/csrc/storage/logger.hpp
kv_connectors/llmd_fs_backend/csrc/storage/storage_handler.hpp
kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp
kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.hpp
kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp
kv_connectors/llmd_fs_backend/csrc/storage/storage_types.hpp
kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp
kv_connectors/llmd_fs_backend/csrc/storage/thread_pool.cpp
kv_connectors/llmd_fs_backend/docs/cufile_rdma.json
kv_connectors/llmd_fs_backend/docs/gds.md
kv_connectors/llmd_fs_backend/llmd_fs_backend/spec.py
kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py
kv_connectors/llmd_fs_backend/setup.py
kv_connectors/llmd_fs_backend/tests/conftest.py
kv_connectors/llmd_fs_backend/tests/test_fs_backend.py
kv_connectors/llmd_fs_backend/tests/test_gds_backend.py
pkg/kvevents/engineadapter/common.go
pkg/kvevents/engineadapter/sglang_adapter.go
pkg/kvevents/engineadapter/sglang_adapter_test.go
pkg/kvevents/engineadapter/vllm_adapter.go
pkg/kvevents/engineadapter/vllm_adapter_bench_test.go
pkg/kvevents/engineadapter/vllm_adapter_test.go
pkg/tokenization/uds_tokenizer.go
tests/e2e/uds_tokenizer/uds_e2e_test.go

💤 Files with no reviewable changes (1)

kv_connectors/llmd_fs_backend/csrc/storage/file_io.hpp

coderabbitai · 2026-04-02T08:46:30Z

kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp

+// File descriptor passed to cuFileHandleRegister (layout must match cufile.h)
+struct CUfileDescr_t {
+  CUfileFileHandleType type;
+  union {
+    int fd;
+    void* handle;
+  } handle;
+  const CUfileFSOps_t* fs_ops;
+};
+
+// Opaque handle returned by cuFileHandleRegister
+typedef void* CUfileHandle_t;
+
+// Driver properties returned by cuFileDriverGetProperties
+struct CUfileDrvProps_t {
+  size_t max_device_cache_size;
+  size_t max_device_pinned_mem_size;
+  char _reserved[256];
+};


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

What is the struct layout for CUfileDescr_t and CUfileDrvProps_t in NVIDIA cuFile library version 1.x?

💡 Result:

The struct layouts for CUfileDescr_t and CUfileDrvProps_t in NVIDIA cuFile library version 1.x, as documented in official NVIDIA GPUDirect Storage cuFile API references (across CUDA 11.4-11.8 and current GDS v1.16 docs), are as follows: CUfileDescr_t: typedef struct CUfileDescr_t { CUfileFileHandleType type; /* type of file being registered / union { int fd; / Linux */ void handle; / Windows */ } handle; const CUfileFSOps_t fs_ops; / file system operation table / } CUfileDescr_t; CUfileDrvProps_t: typedef struct CUfileDrvProps { struct { unsigned int major_version; unsigned int minor_version; size_t poll_thresh_size; size_t max_direct_io_size; unsigned int dstatusflags; unsigned int dcontrolflags; } nvfs; CUfileFeatureFlags_t fflags; unsigned int max_device_cache_size; unsigned int per_buffer_cache_size; unsigned int max_pinned_memory_size; unsigned int max_batch_io_timeout_msecs; / present in some versions like 11.6 */ } CUfileDrvProps_t; Note: Minor variations exist across exact versions (e.g., max_batch_io_timeout_msecs appears in CUDA 11.6+ docs), but the core layout is consistent. The structure is self-describing per cuFileDriverGetProperties documentation. Layouts match across PDF docs, HTML API guides, and manpages for libcufile-dev 1.5+.

Citations:

1: https://docs.nvidia.com/cuda/archive/11.6.1/pdf/cuFile_API.pdf

2: https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html

3: https://docs.nvidia.com/cuda/archive/11.4.1/pdf/cuFile_API.pdf

4: https://docs.nvidia.com/cuda/archive/11.8.0/cufile-api/index.html

5: https://docs.nvidia.com/cuda/archive/11.7.1/cufile-api/index.html

6: https://docs.nvidia.com/gpudirect-storage/api-reference-guide

Critical ABI mismatch in CUfileDrvProps_t struct definition—will cause memory corruption.

The CUfileDrvProps_t struct defined in this file does not match NVIDIA's official cuFile API definition. According to NVIDIA's cuFile documentation (CUDA 11.4–11.8, GDS v1.16), the actual struct contains:

A nested nvfs struct with version and configuration fields

CUfileFeatureFlags_t fflags

unsigned int field types (not size_t)

Fields named max_pinned_memory_size and per_buffer_cache_size (not max_device_pinned_mem_size)

Optional max_batch_io_timeout_msecs in newer versions

The code's simplified definition with _reserved[256] padding will misalign with the actual library structure, causing memory corruption when cuFileDriverGetProperties() writes to it. Fix struct definition to match NVIDIA's headers exactly, or use the official NVIDIA-provided header directly instead of recreating it. CUfileDescr_t is correctly defined.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp` around lines 40 - 58, The CUfileDrvProps_t definition here mismatches the official NVIDIA cuFile ABI and can corrupt memory when cuFileDriverGetProperties() populates it; replace the local struct with the exact definition from NVIDIA's cuFile headers (or include the vendor header directly) so fields like the nested nvfs struct, CUfileFeatureFlags_t fflags, unsigned int-typed fields (e.g., max_pinned_memory_size, per_buffer_cache_size), and any optional max_batch_io_timeout_msecs are present and correctly typed and ordered; keep CUfileDescr_t as-is but remove or replace the _reserved[256] placeholder and ensure CUfileDrvProps_t matches the vendor header exactly.

coderabbitai · 2026-04-02T08:46:30Z

kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

+// Thread-local unique suffix for temporary files (matches file_io.cpp pattern)
+thread_local std::string gds_tmp_suffix =
+    "_" + std::to_string(std::random_device{}()) + ".tmp";


⚠️ Potential issue | 🟡 Minor

Thread-local temp suffix may collide across threads.

std::random_device{}() returns a 32-bit value. With many threads, birthday paradox makes collisions more likely. If two threads get the same suffix and write to the same file_path, they'll use the same temp file, causing data corruption or race conditions.

Include thread ID for uniqueness

-thread_local std::string gds_tmp_suffix = - "_" + std::to_string(std::random_device{}()) + ".tmp"; +thread_local std::string gds_tmp_suffix = + "_" + std::to_string(std::random_device{}()) + "_" + + std::to_string(std::hash<std::thread::id>{}(std::this_thread::get_id())) + + ".tmp";

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp` around lines 32 - 34, The thread-local temporary suffix gds_tmp_suffix is vulnerable to collisions because it uses only std::random_device{}() (32-bit); change its generation to include the calling thread's identity and stronger uniqueness (for example combine std::this_thread::get_id() and a per-thread or global atomic counter or a larger random value) so each thread produces a distinct suffix; update the thread_local gds_tmp_suffix initialization (the symbol gds_tmp_suffix in gds_file_io.cpp) to concatenate thread id and an added unique token to avoid temp file collisions when writing the same file_path.

coderabbitai · 2026-04-02T08:46:30Z

kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

+  for (size_t i = 0; i < num_chunks; i++) {
+    void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size);
+
+    CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr,
+                                                      chunk_size,
+                                                      CU_FILE_RDMA_REGISTER);
+    if (status.err != CU_FILE_SUCCESS) {
+      FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block "
+                  << i << " with error code: " << status.err);
+      return false;
+    }
+
+    m_registered_buffers[block_ptr] = chunk_size;
+  }


⚠️ Potential issue | 🔴 Critical

Buffer overread in chunked registration when size is not a multiple of chunk_size.

When size is not evenly divisible by chunk_size, the final iteration registers chunk_size bytes starting at block_ptr, but only size - (i * chunk_size) bytes remain. This causes cuFileBufRegister to access memory beyond the buffer boundary (undefined behavior).

Fix: clamp the last chunk size

for (size_t i = 0; i < num_chunks; i++) { void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size); + size_t this_chunk_size = std::min(chunk_size, size - (i * chunk_size)); CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr, - chunk_size, + this_chunk_size, CU_FILE_RDMA_REGISTER); if (status.err != CU_FILE_SUCCESS) { FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block " << i << " with error code: " << status.err); return false; } - m_registered_buffers[block_ptr] = chunk_size; + m_registered_buffers[block_ptr] = this_chunk_size; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for (size_t i = 0; i < num_chunks; i++) {

void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size);

CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr,

chunk_size,

CU_FILE_RDMA_REGISTER);

if (status.err != CU_FILE_SUCCESS) {

FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block "

<< i << " with error code: " << status.err);

return false;

}

m_registered_buffers[block_ptr] = chunk_size;

}

for (size_t i = 0; i < num_chunks; i++) {

void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size);

size_t this_chunk_size = std::min(chunk_size, size - (i * chunk_size));

CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr,

this_chunk_size,

CU_FILE_RDMA_REGISTER);

if (status.err != CU_FILE_SUCCESS) {

FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block "

<< i << " with error code: " << status.err);

return false;

}

m_registered_buffers[block_ptr] = this_chunk_size;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp` around lines 215 - 228, The loop that registers GPU buffer chunks (using num_chunks, chunk_size, size, block_ptr and calling m_cufile.cuFileBufRegister) can overread when size isn't a multiple of chunk_size; change the registration to compute remaining = size - (i * chunk_size) and set register_len = std::min(chunk_size, remaining), call cuFileBufRegister with register_len (not chunk_size) and store register_len in m_registered_buffers[block_ptr] so the last block is clamped to the actual remaining bytes.

coderabbitai · 2026-04-02T08:46:31Z

kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

+  // O_RDWR required by cuFile for internal DMA setup even on write-only paths
+  int fd = open(tmp_path.c_str(), O_RDWR | O_CREAT | O_DIRECT, 0644);
+  if (fd < 0) {
+    FS_LOG_ERROR("GdsFileIO: Failed to open temporary file "
+                 << tmp_path << ": " << std::strerror(errno)
+                 << " (errno=" << errno << ")");
+    return false;
+  }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if cuFile documentation specifies alignment requirements for O_DIRECT rg -n -i "align|O_DIRECT" --type=cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 357

🏁 Script executed:

#!/bin/bash # Read the function context around lines 259-266 sed -n '240,280p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1624

🏁 Script executed:

#!/bin/bash # Search for cuFileWrite and cuFileRead calls to see how they use file descriptors rg -n -A 5 "cuFileWrite|cuFileRead" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 7682

🏁 Script executed:

#!/bin/bash # Check for any alignment-related code or buffers rg -n -i "alignment|aligned|align" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55

🏁 Script executed:

#!/bin/bash # Find where actual_gpu_ptr is defined/set rg -n "actual_gpu_ptr" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 576

🏁 Script executed:

#!/bin/bash # Check the tensor_copier to see if it guarantees any alignment rg -n -i "tensor_copier|get_tensors" --type=cpp -A 3 kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/ | head -60

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 3556

🏁 Script executed:

#!/bin/bash # Search for any cuFile documentation or alignment mentions in header files rg -n -i "align|buffer.*requirement|DMA" --type=hpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 100

🏁 Script executed:

#!/bin/bash # View the actual_gpu_ptr assignment with context sed -n '285,310p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1203

🏁 Script executed:

#!/bin/bash # View the read-side assignment sed -n '377,405p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1274

🏁 Script executed:

#!/bin/bash # Find and examine tensor_copier.hpp find . -name "tensor_copier.hpp" -o -name "tensor_copier.h" | head -5

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 135

🏁 Script executed:

#!/bin/bash # Search for cuFileBufRegister to see if GPU buffers are registered with alignment info rg -n "cuFileBufRegister" --type=cpp -B 2 -A 5

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 6897

🏁 Script executed:

#!/bin/bash # View tensor_copier.hpp cat kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp | head -100

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 2196

🏁 Script executed:

#!/bin/bash # Search for error code handling or EINVAL rg -n "EINVAL|error.*code|errno.*22" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1146

🏁 Script executed:

#!/bin/bash # Look for block_size documentation or constraints rg -n -B 2 -A 2 "block_size" kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp | head -40

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 433

Verify block_size alignment or implement fallback to non-O_DIRECT I/O.

O_DIRECT requires buffers aligned to filesystem block boundaries (512 bytes or 4KB). Line 293 calculates actual_gpu_ptr = tensor.data_ptr() + (gpu_block_idx * block_size). If block_size is not divisible by 512 or 4K, this arithmetic produces misaligned pointers, causing cuFileWrite/cuFileRead to fail with EINVAL on certain filesystems. Additionally, error handling (lines 304-307, 396-399) logs generic error codes without distinguishing alignment failures, preventing graceful degradation. Either assert that block_size meets alignment constraints, or implement fallback to standard I/O when O_DIRECT fails.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp` around lines 259 - 266, The current use of O_DIRECT when opening the temp file (open(..., O_DIRECT)) can cause EINVAL in cuFileRead/cuFileWrite if tensor.data_ptr() + (gpu_block_idx * block_size) or block_size is not aligned to filesystem block boundaries; update the Open+IO path in GdsFileIO so it first verifies alignment (ensure block_size and computed actual_gpu_ptr are aligned to 512/4096 as required) and if not aligned or if open with O_DIRECT or the cuFile calls return EINVAL, retry the operation by reopening the file without O_DIRECT and performing standard buffered I/O as a fallback; add explicit checks around the arithmetic that computes actual_gpu_ptr (referencing tensor.data_ptr() and gpu_block_idx) and improve error logging in the cuFileWrite/cuFileRead error handlers to distinguish alignment errors vs other errno values so the code can decide to retry without O_DIRECT.

coderabbitai · 2026-04-02T08:46:31Z

kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

+// StorageHandler interface: Read blocks from file
+bool GdsFileIO::read_blocks_from_file(const std::string& file_path,
+                                      const std::vector<int64_t>& block_ids,
+                                      cudaStream_t stream) {
+  // Each ThreadPool thread has its own CUDA stream, but cuFileRead is
+  // synchronous and operates directly on the device — no stream needed.
+  (void)stream;


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check how async_load_gpu_blocks callers synchronize ast-grep --pattern 'async_load_gpu_blocks($$$)' rg -n -B10 -A5 'async_load_gpu_blocks'

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 6845

🏁 Script executed:

# Get the full implementation of async_load_gpu_blocks to understand synchronization sed -n '286,350p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 2442

🏁 Script executed:

# Also get the write path implementation for comparison sed -n '210,280p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 3062

🏁 Script executed:

# Check for synchronization calls related to gpu_kvs_ready_event rg -n 'gpu_kvs_ready_event' kv_connectors/llmd_fs_backend/csrc/storage/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 468

🏁 Script executed:

# Check how read_blocks_from_file is called within async_load_gpu_blocks rg -n 'read_blocks_from_file' kv_connectors/llmd_fs_backend/csrc/storage/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1184

🏁 Script executed:

# Check the wait_job and get_finished implementations sed -n '130,180p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1659

🏁 Script executed:

# Check how Python caller waits after async_load_gpu_blocks sed -n '230,260p' kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1128

🏁 Script executed:

# Check get_finished implementation rg -n 'get_finished' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp -A10

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 439

🏁 Script executed:

# Check if ThreadPool's TLS stream is related to CUDA default stream rg -n 'get_tls_stream\|ThreadPool' kv_connectors/llmd_fs_backend/csrc/storage/ -A5 | head -80

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55

🏁 Script executed:

# Get more context on wait_job implementation sed -n '170,200p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 802

🏁 Script executed:

# Check GdsFileIO read_blocks_from_file full implementation sed -n '340,400p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 2520

🏁 Script executed:

# Check if there's stream synchronization after async jobs rg -n 'cudaStreamSynchronize\|cudaStreamWaitEvent' kv_connectors/llmd_fs_backend/csrc/storage/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55

🏁 Script executed:

# Check how the loaded blocks are accessed in Python after recording job rg -n '_record_job\|wait_job\|get_finished' kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py -B3 -A10

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55

🏁 Script executed:

# Check the end of read_blocks_from_file to see full implementation sed -n '390,430p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1463

🏁 Script executed:

# Check FileIO::read_blocks_from_file for comparison (non-GDS variant) sed -n '190,230p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1348

🏁 Script executed:

# Check what happens after _record_job is called - trace the control flow grep -n '_record_job\|wait_job\|get_finished' kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 404

🏁 Script executed:

# Check if there's any explicit synchronization at the scheduler level rg -n 'wait_job\|get_finished' kv_connectors/llmd_fs_backend/llmd_fs_backend/ -B5 -A5

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55

🏁 Script executed:

# Check scheduler/high-level usage to see when buffers become accessible rg -n 'get_finished\|wait_job' kv_connectors/llmd_fs_backend/llmd_fs_backend/ -B10 -A10

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55

🏁 Script executed:

# Check if there's any CUDA synchronization between wait_job and actual GPU access rg -n 'torch.cuda\|cuda_stream\|synchronize' kv_connectors/llmd_fs_backend/llmd_fs_backend/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55

🏁 Script executed:

# Compare the two read implementations side by side - check for stream sync rg -n 'cudaStreamSynchronize' kv_connectors/llmd_fs_backend/csrc/storage/backends/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 615

GdsFileIO read path ignores CUDA stream and lacks GPU synchronization, creating potential data race with concurrent GPU access.

GdsFileIO::read_blocks_from_file (line 340-346) discards the stream parameter with (void)stream; and performs cuFileRead directly without stream synchronization. This contrasts with the write path (async_store_gpu_blocks), which waits on gpu_kvs_ready_event before write operations start, and with FileIO::read_blocks_from_file, which calls cudaStreamSynchronize(stream) after read completes (line 217). Since cuFileRead is synchronous but not stream-aware, concurrent GPU kernels reading from the target buffers during or shortly after the read operation will access stale or partially-written data, risking corruption.

Add pre-read synchronization via cudaStreamWaitEvent (like the write path) or post-read cudaStreamSynchronize (like FileIO) to ensure GPU work on the buffers completes before the read begins.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp` around lines 340 - 346, GdsFileIO::read_blocks_from_file currently ignores the provided cudaStream_t and performs cuFileRead without GPU synchronization; update this function to synchronize GPU work on the target device buffers before/after the cuFileRead (mirror the write path and FileIO behavior) by calling cudaStreamWaitEvent(stream, gpu_kvs_ready_event) before issuing the read or calling cudaStreamSynchronize(stream) after the read completes (or both if needed) to prevent concurrent kernel access to partially-written data; reference GdsFileIO::read_blocks_from_file, gpu_kvs_ready_event, cudaStreamWaitEvent, cudaStreamSynchronize, async_store_gpu_blocks, and FileIO::read_blocks_from_file when making the change.

coderabbitai · 2026-04-02T08:46:31Z

kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp

+          // Execute the read operation using polymorphic storage handler
          try {
-            // Stage 1: Read file to staging CPU tensor.
-            // Read data from disk into a tensor.
-            success = TIME_EXPR("read phase 1: read_buffer_from_file",
-                                read_buffer_from_file(src_file, buf),
-                                "file:",
-                                src_file);
+            size_t total_size =
+                block_ids.size() * m_tensor_copier.get_block_size();
+            success = TIME_EXPR_THROUGHPUT(
+                "read: storage handler",
+                m_read_handler->read_blocks_from_file(src_file,
+                                                      block_ids,
+                                                      tls_stream.stream()),
+                total_size,
+                "file:",
+                src_file,
+                " blocks:",
+                block_ids.size());
            if (!success) {
-              FS_LOG_ERROR("Stage1 read_buffer_from_file failed for "
-                           << src_file);
-              return success;
-            }
-            // Stage 2:  copy tensors from staging CPU tensor to GPU.
-            // Perform asynchronous GPU copy and tensor swap.
-            auto* cpu_base = static_cast<uint8_t*>(buf.ptr);
-            bool is_store = false;
-            // Execute the copy operation
-            success = TIME_EXPR(
-                "read phase 2: copy_cpu_tensor_to_gpu_tensors",
-                m_tensor_copier.copy_blocks(cpu_base, block_ids, is_store),
-                "file: ",
-                src_file);
-
-            auto& tls_stream = ThreadPool::get_tls_stream();
-            cudaError_t err = cudaStreamSynchronize(tls_stream.stream());
-            if (err != cudaSuccess) {
-              FS_LOG_ERROR(
-                  "cudaStreamSynchronize failed: " << cudaGetErrorString(err));
-              return false;
+              FS_LOG_ERROR("Load failed for " << src_file);
            }
+            return success;


⚠️ Potential issue | 🟠 Major

Missing atime update after GDS read operations breaks LRU eviction.

The write path calls FileIO::update_atime() when a file already exists (line 231), but the read path never updates atime. When m_read_handler is a GdsFileIO instance, files read via GDS won't have their access times updated, breaking LRU cache eviction logic that depends on atime. As noted in relevant code snippet from storage_offload.cpp:313-321, no update_atime() call follows the read operation.

Add atime update after successful read

success = TIME_EXPR_THROUGHPUT( "read: storage handler", m_read_handler->read_blocks_from_file(src_file, block_ids, tls_stream.stream()), total_size, "file:", src_file, " blocks:", block_ids.size()); if (!success) { FS_LOG_ERROR("Load failed for " << src_file); + } else { + FileIO::update_atime(src_file); } return success;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp` around lines 310 - 327, The read path in storage_offload.cpp does not update access time after a successful GDS read, which breaks LRU eviction; after a successful TIME_EXPR_THROUGHPUT call that sets success from m_read_handler->read_blocks_from_file (the block read branch in the function where src_file and block_ids are used), call the FileIO::update_atime for the same src_file (or invoke the appropriate update_atime method on the file IO implementation such as GdsFileIO) before returning; ensure the update is performed only when success is true and handle/log any update failures similarly to the write path's update_atime usage so atime is kept in sync for LRU eviction.

coderabbitai · 2026-04-02T08:46:31Z

kv_connectors/llmd_fs_backend/docs/cufile_rdma.json

+{
+    // Optimized cuFile configuration for RDMA-based storage (NVMe-oF, BeeGFS, Lustre, GPFS, Weka).
+    //
+    // REQUIRED: Set rdma_dev_addr_list to the IP addresses of your RDMA NICs.
+    //   To find them: `ibdev2netdev` or `ip addr show` (look for your RDMA interface IPs).
+    //
+    // Usage:
+    //   export CUFILE_ENV_PATH_JSON=/path/to/cufile_rdma.json
+    //
+    // NOTE: Application can override custom configuration via export CUFILE_ENV_PATH_JSON=<filepath>
+    // e.g : export CUFILE_ENV_PATH_JSON="/home/<xxx>/cufile_rdma.json"
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does NVIDIA cuFile configuration support JSON with comments?

💡 Result:

No, NVIDIA cuFile configuration does not support JSON with comments. Official NVIDIA documentation for cuFile shows the /etc/cufile.json configuration file using standard JSON syntax with inline comments in example code (e.g., result 1: "// specify list of file-systems"), but these appear as documentation artifacts rather than parsed comments. Standard JSON parsers reject comments, and no cuFile-specific documentation or source evidence indicates use of a JSONC (JSON with comments) parser. RAPIDS cuDF JSON reader explicitly lacks comment support (result 16), reinforcing standard JSON expectations in NVIDIA ecosystem tools. cuFile is a proprietary closed-source library; no public source confirms custom parsing.

Citations:

1: https://docs.nvidia.com/cuda/archive/11.6.1/pdf/cuFile_API.pdf

2: https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html

3: https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html.md

4: https://docs.nvidia.com/cuda/archive/11.4.3/cufile-api/index.html

5: https://docs.nvidia.com/cuda/archive/11.8.0/cufile-api/index.html

6: [FEA] JSON reader: ignores Java/C++ style comment rapidsai/cudf#10265

7: https://docs.nvidia.com/cuda/archive/11.4.0/gds/configuration-guide/topics/gds-parameters.html

Remove // comments—cuFile requires standard JSON and will fail to parse this configuration.

NVIDIA cuFile does not support JSON with comments. This template will generate parse errors when users attempt to use it. Either:

Remove all // comments and keep as .json, or

Provide cufile_rdma.json (valid JSON, no comments) alongside this file as .md documentation.

Users cannot directly use this as CUFILE_ENV_PATH_JSON without editing.

🧰 Tools

🪛 Biome (2.4.9)

[error] 2-2: Expected a property but instead found '// Optimized cuFile configuration for RDMA-based storage (NVMe-oF, BeeGFS, Lustre, GPFS, Weka).'.

(parse)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json` around lines 1 - 12, The cufile_rdma.json template contains // comments which make it invalid JSON for cuFile; update the file cufile_rdma.json so it is strict JSON with all comments removed (leave only JSON key/value entries), or instead convert this file into a .md (documentation) and add a separate valid cufile_rdma.json without comments; ensure any references to CUFILE_ENV_PATH_JSON in the content remain unchanged and the final JSON parses cleanly with no // comment lines.

coderabbitai · 2026-04-02T08:46:31Z

kv_connectors/llmd_fs_backend/docs/gds.md

+
+GDS uses the **cuFile** library by default for all GPU↔storage transfers. cuFile behavior (I/O threads,
+bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a `cufile_rdma.json`
+config file — see [Tuning cuFile](#tuning-cufile-cufilejson) for an example.


⚠️ Potential issue | 🟡 Minor

Broken link fragment: heading anchor mismatch.

The link #tuning-cufile-cufilejson won't resolve to the heading at line 93 (## Tuning cuFile (\cufile_rdma.json`)). GitHub generates the anchor as #tuning-cufile-cufile_rdmajson` (with underscores from the filename).

Proposed fix

-bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a `cufile_rdma.json` -config file — see [Tuning cuFile](`#tuning-cufile-cufilejson`) for an example. +bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a `cufile_rdma.json` +config file — see [Tuning cuFile](`#tuning-cufile-cufile_rdmajson`) for an example.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

config file — see [Tuning cuFile](#tuning-cufile-cufilejson) for an example.

config file — see [Tuning cuFile](`#tuning-cufile-cufile_rdmajson`) for an example.

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 8-8: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/docs/gds.md` at line 8, The link fragment in gds.md uses the wrong anchor `#tuning-cufile-cufilejson`; update it to match the generated GitHub anchor for the heading `## Tuning cuFile (`cufile_rdma.json`)` by changing the fragment to `#tuning-cufile-cufile_rdmajson` (or alternatively rename the heading to remove the underscore to match the existing link). Locate the incorrect fragment string in the link and replace it with the corrected anchor `#tuning-cufile-cufile_rdmajson` (or adjust the heading text `Tuning cuFile (cufile_rdma.json)`) so the link resolves.

coderabbitai · 2026-04-02T08:46:31Z

kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py

        gpu_block_size: int,
        gpu_blocks_per_file: int,
        threads_per_gpu: int,
+        gds_mode: str,


⚠️ Potential issue | 🟡 Minor

Missing default value for gds_mode parameter.

The constructor lacks a default value for gds_mode, making it a required argument. This is a breaking change for any existing callers. Consider adding gds_mode: str = "disabled" for backward compatibility, consistent with the C++ binding default.

Proposed fix

def __init__( self, kv_caches: dict[str, torch.Tensor], attn_backends: dict[str, type[AttentionBackend]], file_mapper: FileMapper, gpu_block_size: int, gpu_blocks_per_file: int, threads_per_gpu: int, - gds_mode: str, + gds_mode: str = "disabled", max_staging_memory_gb: int = DEFAULT_MAX_STAGING_MEMORY_GB, read_preferring_ratio: float = DEFAULT_READ_PREFERRING_WORKERS_RATIO, ):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

gds_mode: str,

gds_mode: str = "disabled",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py` at line 253, The constructor for the worker class (the __init__ method in llmd_fs_backend/worker.py) makes gds_mode a required parameter which breaks callers; change the signature to give gds_mode a default of "disabled" (e.g., gds_mode: str = "disabled") so callers are backward compatible and consistent with the C++ binding default, and update any internal references or docstrings that assume a provided value if necessary.

vMaroon and others added 6 commits March 30, 2026 10:01

fix: increase RenderChat timeout for multimodal requests (llm-d#481)

2e90fe8

Multimodal requests need to download and process images, which can exceed the 5s default timeout in CI. Use 30s for requests with structured content parts.

feat: add GPUDirect Storage (GDS) support to llmd_fs_backend (llm-d#419)

08138e5

* feat: add GPUDirect Storage (GDS) support to llmd_fs_backend Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>

test: add golden test cases for deterministic pipeline output verific…

263005e

…ation (llm-d#475)

Sync upstream llm-d/llm-d-kv-cache 1c435cb

81290a7

zdtsw requested review from Gregory-Pereira, acardace, aneeshkp, anishasthana, dagrayvid, evacchi and wseaton as code owners April 2, 2026 08:34

zdtsw removed request for acardace, aneeshkp, dagrayvid, evacchi and wseaton April 2, 2026 08:35

zdtsw added the odh-3.4.0 label Apr 2, 2026

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

	config file — see [Tuning cuFile](#tuning-cufile-cufilejson) for an example.
	config file — see [Tuning cuFile](`#tuning-cufile-cufile_rdmajson`) for an example.

Conversation

zdtsw commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Security & Design Findings

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zdtsw commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading