Skip to content

[sync] upstream llm-d main branch 1c435cb [2026-04-02]#16

Open
zdtsw wants to merge 6 commits intoopendatahub-io:main_2from
zdtsw-forking:sync/upstream-1c435cb
Open

[sync] upstream llm-d main branch 1c435cb [2026-04-02]#16
zdtsw wants to merge 6 commits intoopendatahub-io:main_2from
zdtsw-forking:sync/upstream-1c435cb

Conversation

@zdtsw
Copy link
Copy Markdown
Member

@zdtsw zdtsw commented Apr 2, 2026

Syncs llm-d/llm-d-kv-cache main branch into ODH main_2 branch.

Upstream commit: llm-d@1c435cb

this maps to https://github.com/llm-d/llm-d-kv-cache/releases/tag/v0.7.0

Summary by CodeRabbit

Release Notes

  • New Features

    • Added GPUDirect Storage (GDS) support for GPU-accelerated file I/O with configurable read/write and bounce-buffer modes.
  • Improvements

    • Enhanced vLLM event decoding for improved compatibility.
    • Improved tokenizer timeout handling for multimedia content requests.
  • Documentation

    • Added comprehensive GDS setup guides and configuration examples.
  • Tests

    • Added GDS backend tests and vLLM event decoding benchmarks.

vMaroon and others added 6 commits March 30, 2026 10:01
Multimodal requests need to download and process images, which can
exceed the 5s default timeout in CI. Use 30s for requests with
structured content parts.
* feat: add GPUDirect Storage (GDS) support to llmd_fs_backend

Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
Granite's built-in chat template injects the current date via
strftime_now(), causing goldenChatTokenIDs to change daily.
Add an explicit system message to the golden chat conversation
so the template uses the provided content instead of generating
a date-dependent default system prompt.
  with vLLM event schema (llm-d#484)

* fix: single-pass []any decode for forward/backward compat with vLLM event schema

vLLM uses msgspec with array_like=True and omit_defaults=True, producing
positional msgpack arrays where trailing fields may be absent. The previous
typed-struct decode broke when vLLM appended new fields (old consumer fails)
or when a newer consumer read from an older vLLM (shorter array than expected).

Replace double-decode ([]any for tag + typed struct) with a single unmarshal
into []any and positional extraction with length guards. Extra trailing fields
from newer vLLM are silently ignored; missing trailing fields from older vLLM
get zero values.

* test: add decode benchmarks for vLLM event schema compat

* fix: address review — reuse shared helpers, fix lint

- Reuse convertBlockHashes() from common.go in both vLLM converters
- Move convertExtraKeys() back to common.go (shared with SGLang)
- Remove unused engineName param from decodeEvent()
- Fix bench test lint: error checks, paramTypeCombine, appendCombine

* fix: suppress gosec G115 in bench test data construction
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

This PR introduces GPUDirect Storage (GDS) support to enable direct GPU-to-file I/O via NVIDIA cuFile. Changes include: (1) dynamic runtime loading of libcufile.so via dlopen with fallback symbol resolution; (2) a new StorageHandler abstract interface with FileIO (CPU-staged) and GdsFileIO (GDS-backed) implementations; (3) mode-dependent storage dispatch in StorageOffloadEngine selecting between CPU and GDS paths; (4) Python binding updates to accept a gds_mode configuration parameter; (5) kvevents adapter refactoring to support msgspec positional decoding for vLLM events; (6) supporting infrastructure including enums (StorageMode, GdsMode), documentation, and tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Security & Design Findings

CWE-427 (Untrusted Search Path): cufile_loader.hpp uses dlopen("libcufile.so") and dlopen("libcufile.so.0") without absolute paths or integrity validation. No constraints on LD_LIBRARY_PATH or library version. Recommend: validate library signature or restrict search to /usr/lib64 or require absolute path via environment variable.

CWE-252 (Unchecked Return Value): dlsym() calls in cufile_loader.hpp resolve function pointers but leave them nullptr on failure without distinguishing partial load failures from total failure. Callers cannot reliably detect which functions are unavailable. Consider explicit per-function error logging at load time.

CWE-367 (Time-of-Check-Time-of-Use): gds_file_io.cpp writes to temporary file then renames atomically (good), but no validation that parent directory creation succeeds before write. std::filesystem::create_directories errors are not explicitly checked.

Design Issue: StorageOffloadEngine constructor now requires gds_mode string that is parsed in spec.py but validated only in worker.py after construction. Prefer early validation at binding time or factory method. String-based mode dispatch in parse_gds_mode() has no exhaustive pattern matching—falling back to DISABLED silently masks typos.

Architectural Concern: TensorCopier now exposes accessors (get_tensors(), get_block_size()) to enable GdsFileIO initialization, breaking encapsulation. Consider passing these via constructor parameters instead of exposing internal state.

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive Title references a specific upstream commit hash (1c435cb) and date (2026-04-02), indicating a sync operation. However, it does not clearly convey what the main substantive changes are—the PR includes GDS support, vLLM event schema changes, timeout tuning, and golden tests, but the title obscures these behind a generic sync pattern. Consider expanding the title to highlight the primary feature addition (e.g., '[sync] Add GDS support and vLLM event compatibility updates from upstream') or clarifying the scope beyond the commit hash reference.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp (1)

131-142: ⚠️ Potential issue | 🟠 Major

gds_mode parameter is unused; staging buffer allocated even for full-GDS modes.

The function signature accepts GdsMode gds_mode but never references it. Per the header comment ("0 for full-GDS modes"), this should return 0 when gds_mode indicates direct GDS transfers, avoiding unnecessary memory allocation.

Proposed fix
 size_t StorageOffloadEngine::calc_staging_bytes(
     int gpu_blocks_per_file,
     const std::vector<torch::Tensor>& tensors,
     GdsMode gds_mode) {
+  // Full-GDS modes don't need CPU staging buffers
+  if (gds_mode == GdsMode::READ_WRITE) {
+    return 0;
+  }
   size_t block_size_in_bytes = 0;
   for (const auto& tensor : tensors) {
     block_size_in_bytes += static_cast<size_t>(tensor.stride(0)) *
                            static_cast<size_t>(tensor.element_size());
   }
   return block_size_in_bytes * static_cast<size_t>(gpu_blocks_per_file);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp` around lines
131 - 142, The calc_staging_bytes function currently ignores the gds_mode
parameter and always computes a staging buffer size; update
StorageOffloadEngine::calc_staging_bytes to return 0 when gds_mode indicates
full-GDS/direct GDS transfer mode (per the header note "0 for full-GDS modes")
and only compute the byte-size loop for non-GDS modes; locate the function by
name and add a conditional branch that checks the appropriate GdsMode enum/flag
for full-GDS before performing the tensor stride*element_size accumulation so
staging memory is not allocated for direct-GDS transfers.
🧹 Nitpick comments (12)
pkg/tokenization/uds_tokenizer.go (2)

76-78: Reuse mmTimeout in warmup() for consistency.

Line 200 hardcodes 30*time.Second for the warmup context timeout. Consider using mmTimeout to keep the value in sync.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tokenization/uds_tokenizer.go` around lines 76 - 78, The warmup()
function currently hardcodes a 30*time.Second context timeout; replace that
literal with the mmTimeout constant to keep timeout values consistent. Locate
the context creation in warmup() (e.g., context.WithTimeout(...,
30*time.Second)) and change the timeout argument to mmTimeout so the function
uses the shared mmTimeout variable defined near the top of the file.

284-291: Parent context deadline ignored; gRPC call may outlive caller's timeout.

context.Background() is hardcoded, so if upstream callers (e.g., HTTP handlers, orchestrators) have their own deadlines, this 30s timeout will not respect them. A slow backend could hold resources long after the original request is abandoned.

Consider accepting a parent context.Context parameter and deriving the timeout from it:

ctx, cancel := context.WithTimeout(parentCtx, timeout)

This requires a signature change to RenderChat (and the Tokenizer interface), which may be out of scope for this PR. If so, document this limitation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tokenization/uds_tokenizer.go` around lines 284 - 291, The RenderChat
call is creating a new context from context.Background(), ignoring upstream
deadlines; change the API to accept a parent context.Context and derive the
timeout from it (use ctx, cancel := context.WithTimeout(parentCtx, timeout)),
update the Tokenizer interface signature (Tokenizer.RenderChat) and all callers
to pass through the caller's context, and ensure cancel() is deferred; if
changing signatures is out of scope, add a clear code comment on RenderChat and
the Tokenizer interface describing this limitation and the potential resource
leak so it is documented for future work.
pkg/kvevents/engineadapter/vllm_adapter.go (1)

293-316: Missing int case in type switch.

toInt handles int8-int64 and uint8-uint64 but not the base int type. While msgpack v5 typically decodes integers as int64/uint64, if any code path passes a Go int, this will fail with a confusing "unsupported numeric type: int" error.

Add int case for completeness
 func toInt(raw any) (int, error) {
 	switch v := raw.(type) {
+	case int:
+		return v, nil
 	case int64:
 		return int(v), nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/kvevents/engineadapter/vllm_adapter.go` around lines 293 - 316, The toInt
function is missing a case for the native Go int type which causes "unsupported
numeric type: int" errors; update the type switch in toInt (function toInt) to
include a case for int that returns int(v), nil (you can add a comment similar
to the uint64 case about overflow not being a concern) so the function handles
int values alongside int8/int16/int32/int64 and the unsigned variants.
kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp (3)

144-149: utimensat return value ignored—silent failure on atime update.

If utimensat fails (e.g., permissions, non-existent file), the error is silently discarded. Consider logging at DEBUG level for troubleshooting.

Proposed fix
 void FileIO::update_atime(const std::string& path) {
   struct timespec times[2];
   times[0].tv_nsec = UTIME_NOW;   // atime → now
   times[1].tv_nsec = UTIME_OMIT;  // mtime → unchanged
-  utimensat(AT_FDCWD, path.c_str(), times, 0);
+  if (utimensat(AT_FDCWD, path.c_str(), times, 0) != 0) {
+    FS_LOG_DEBUG("Failed to update atime for: " << path << " - "
+                                                 << std::strerror(errno));
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp` around
lines 144 - 149, FileIO::update_atime currently calls utimensat and ignores its
return value; change it to check the return value and, on error (return < 0),
log a DEBUG-level message including the path and strerror(errno) (or errno) to
aid troubleshooting; ensure you include context (function
name/FileIO::update_atime and the path) in the log and keep behavior otherwise
unchanged (no exception thrown) so callers remain unaffected.

66-74: pubsetbuf after opening has implementation-defined behavior.

Per C++ standard, calling pubsetbuf after a file is already open may be ignored by some implementations. Move the buffer setup before open() or use the two-argument std::ofstream constructor after setting the buffer.

Proposed fix
-  std::ofstream ofs(tmp_path, std::ios::out | std::ios::binary);
-  if (!ofs) {
+  std::ofstream ofs;
+  ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE);
+  ofs.open(tmp_path, std::ios::out | std::ios::binary);
+  if (!ofs.is_open()) {
     FS_LOG_ERROR("Failed to open temporary file for writing: "
                  << tmp_path << " - " << std::strerror(errno));
     return false;
   }
-
-  // Apply the custom buffer to the file stream
-  ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp` around
lines 66 - 74, The call to ofs.rdbuf()->pubsetbuf(...) occurs after the
std::ofstream is opened which is implementation-defined and may be ignored;
change to set the buffer on the streambuf before opening the file (use a
default-constructed std::ofstream ofs, call
ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE), then call
ofs.open(tmp_path, std::ios::out | std::ios::binary) and check the stream), or
alternatively use a constructor/approach that ensures the buffer is installed
prior to opening; update the code referencing tmp_path, ofs,
thread_write_buffer, and WRITE_BUFFER_SIZE accordingly.

44-45: Potential tmp file collision across processes using same storage path.

std::random_device{}() returns a 32-bit value. While thread-local ensures per-thread uniqueness, multiple processes could generate identical suffixes. Consider including PID:

Proposed fix
+#include <unistd.h>
+
 thread_local std::string tmp_file_suffix =
-    "_" + std::to_string(std::random_device{}()) + ".tmp";
+    "_" + std::to_string(getpid()) + "_" +
+    std::to_string(std::random_device{}()) + ".tmp";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp` around
lines 44 - 45, The thread-local tmp_file_suffix in file_io.cpp uses
std::random_device{}() which can collide across processes; modify construction
of tmp_file_suffix (the thread_local std::string) to include the current process
identifier plus a higher-entropy random value so suffixes are unique across
processes (e.g., concatenate "_" + process id + "_" + a 64-bit RNG value +
".tmp"). Use appropriate API for PID on each platform (getpid() on POSIX or
GetCurrentProcessId() on Windows) or fall back to
std::hash<std::thread::id>()/timestamp as needed, and seed a std::mt19937_64
with std::random_device for the random portion. Ensure the change preserves
thread_local semantics and the variable name tmp_file_suffix remains the same.
kv_connectors/llmd_fs_backend/docs/cufile_rdma.json (1)

69-69: Placeholder values must be replaced before use.

rdma_dev_addr_list contains <RDMA_NIC_IP_1> placeholders. This will cause cuFile parse failures if not replaced. The instructions at lines 4-5 are clear, but consider adding a startup validation in the GDS initialization code to detect and warn about placeholder values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json` at line 69, The
rdma_dev_addr_list JSON contains placeholder entries like "<RDMA_NIC_IP_1>"
which will break cuFile parsing; add a startup validation in the GDS
initialization path (e.g., the function that performs GDS
initialization/registration such as initGDS/initializeGDS or the routine that
loads cuFile/GDS config) that inspects the rdma_dev_addr_list array and fails
early or logs a clear error if any entry matches the placeholder pattern (e.g.,
contains '<' or matches /<.*>/ or known tokens like "<RDMA_NIC_IP_"),
instructing the user to replace them; make this validation produce a fatal error
or warning with actionable text so misconfigured JSON is detected before cuFile
is invoked.
kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp (1)

54-57: Docstring formatting issue: missing newline before gds_mode description.

Line 55-56 runs read_ratio) directly into gds_mode: without a separating newline.

Proposed fix
            "  read_preferring_workers: Number of workers that check "
-           "  read queue first (calculated as int(io_threads * read_ratio) "
-           "  gds_mode: GDS operation mode (see GdsMode in storage_types.hpp). "
+           "  read queue first (calculated as int(io_threads * read_ratio)).\n"
+           "  gds_mode: GDS operation mode (see GdsMode in storage_types.hpp). "
            "Defaults to 'disabled'.\n")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp`
around lines 54 - 57, The docstring in storage_offload_bindings.cpp concatenates
"read_ratio)" and "gds_mode:" without a newline; update the string literal so
there is a newline (e.g., "\n") before the "gds_mode: GDS operation mode..."
description to separate the paragraphs referring to
read_preferring_workers/read_ratio and gds_mode; locate the multi-line docstring
containing "read_preferring_workers" and "gds_mode" and insert the missing
newline in that literal.
kv_connectors/llmd_fs_backend/docs/gds.md (1)

76-78: Fenced code blocks should specify a language.

Add text or log as the language identifier for example output blocks to satisfy markdown linting and improve rendering.

Proposed fix
-```
+```text
 GdsFileIO: GPUDirect Storage (GDS) enabled

If GDS initialization fails, the connector automatically falls back to CPU staging and logs:

- +text
StorageOffloadEngine: GDS initialization failed, falling back to CPU_BUFFER_STAGE for both READ and WRITE
StorageOffloadEngine: READ=CPU WRITE=CPU

Also applies to: 82-85

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/docs/gds.md` around lines 76 - 78, The markdown
example output blocks for GdsFileIO and the StorageOffloadEngine logs are
missing a fenced-code language; update the two fenced blocks containing
"GdsFileIO: GPUDirect Storage (GDS) enabled" and the block with
"StorageOffloadEngine: GDS initialization failed..." / "StorageOffloadEngine:
READ=CPU WRITE=CPU" to use a language identifier (e.g., add ```text or ```log)
so the code fences include a language for proper linting and rendering; locate
the blocks by the unique text snippets "GdsFileIO: GPUDirect Storage (GDS)
enabled" and "StorageOffloadEngine: GDS initialization failed, falling back to
CPU_BUFFER_STAGE" when making the change.
kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp (1)

104-110: Silent failure when dlopen fails; no diagnostic output.

When libcufile.so cannot be loaded, the code returns silently with m_handle = nullptr. Adding dlerror() output would help diagnose missing library paths or dependency issues in production.

Log dlopen failure reason
   CuFileApi() {
     m_handle = dlopen("libcufile.so", RTLD_NOW);
     if (!m_handle) {
       // Try versioned name
       m_handle = dlopen("libcufile.so.0", RTLD_NOW);
     }
-    if (!m_handle) return;
+    if (!m_handle) {
+      // Optionally log: dlerror() contains the failure reason
+      return;
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`
around lines 104 - 110, The CuFileApi constructor currently returns silently
when dlopen fails (m_handle == nullptr); update CuFileApi to call dlerror() and
log or print the returned error string when both dlopen("libcufile.so",
RTLD_NOW) and dlopen("libcufile.so.0", RTLD_NOW) fail so you get diagnostic
output about why the library couldn't be loaded; locate the CuFileApi
constructor and add a diagnostic message referencing dlerror() (and the
attempted names "libcufile.so" / "libcufile.so.0") before returning so failures
are visible at runtime.
kv_connectors/llmd_fs_backend/tests/test_gds_backend.py (2)

30-34: Exception handling swallows all errors silently.

Catching bare Exception and returning False loses diagnostic information. If lsmod or ldconfig fails for unexpected reasons (e.g., permissions), the failure reason is hidden.

Log the exception for debugging
     try:
         result = subprocess.run(["lsmod"], capture_output=True, text=True, check=False)
         nvidia_fs_loaded = "nvidia_fs" in result.stdout
-    except Exception:
+    except Exception as e:
+        # Log for debugging; proceed without nvidia_fs detection
         nvidia_fs_loaded = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/tests/test_gds_backend.py` around lines 30 -
34, The test currently swallows all errors in the try/except around the
subprocess.run call that sets nvidia_fs_loaded, making failures silent; update
the except block in the same try that invokes subprocess.run(["lsmod"]) to log
the caught exception (include the exception details/traceback) before setting
nvidia_fs_loaded = False — use the project logger or Python's logging with
logger.exception() or logging.error(..., exc_info=True) so the failure reason is
recorded for debugging while preserving the fallback behavior.

60-73: Minor inefficiency: duplicate status collection.

check_gds_available() and get_gds_status_message() both call _collect_gds_status(). In test_gds_available() (lines 81-82), both are invoked, resulting in redundant subprocess calls. Consider returning status from check_gds_available() or caching the result.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/tests/test_gds_backend.py` around lines 60 -
73, The two functions call _collect_gds_status() twice causing redundant work;
change check_gds_available() to return the collected status dict (or store it in
a module-level cached variable) instead of just a bool, then update
get_gds_status_message() and test_gds_available() to reuse that returned status
(or read the cached value) so _collect_gds_status() is invoked only once;
reference the functions check_gds_available, get_gds_status_message, and
_collect_gds_status when making the changes and adjust test_gds_available() to
accept the new return shape or use the cache.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`:
- Around line 40-58: The CUfileDrvProps_t definition here mismatches the
official NVIDIA cuFile ABI and can corrupt memory when
cuFileDriverGetProperties() populates it; replace the local struct with the
exact definition from NVIDIA's cuFile headers (or include the vendor header
directly) so fields like the nested nvfs struct, CUfileFeatureFlags_t fflags,
unsigned int-typed fields (e.g., max_pinned_memory_size, per_buffer_cache_size),
and any optional max_batch_io_timeout_msecs are present and correctly typed and
ordered; keep CUfileDescr_t as-is but remove or replace the _reserved[256]
placeholder and ensure CUfileDrvProps_t matches the vendor header exactly.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`:
- Around line 215-228: The loop that registers GPU buffer chunks (using
num_chunks, chunk_size, size, block_ptr and calling m_cufile.cuFileBufRegister)
can overread when size isn't a multiple of chunk_size; change the registration
to compute remaining = size - (i * chunk_size) and set register_len =
std::min(chunk_size, remaining), call cuFileBufRegister with register_len (not
chunk_size) and store register_len in m_registered_buffers[block_ptr] so the
last block is clamped to the actual remaining bytes.
- Around line 32-34: The thread-local temporary suffix gds_tmp_suffix is
vulnerable to collisions because it uses only std::random_device{}() (32-bit);
change its generation to include the calling thread's identity and stronger
uniqueness (for example combine std::this_thread::get_id() and a per-thread or
global atomic counter or a larger random value) so each thread produces a
distinct suffix; update the thread_local gds_tmp_suffix initialization (the
symbol gds_tmp_suffix in gds_file_io.cpp) to concatenate thread id and an added
unique token to avoid temp file collisions when writing the same file_path.
- Around line 259-266: The current use of O_DIRECT when opening the temp file
(open(..., O_DIRECT)) can cause EINVAL in cuFileRead/cuFileWrite if
tensor.data_ptr() + (gpu_block_idx * block_size) or block_size is not aligned to
filesystem block boundaries; update the Open+IO path in GdsFileIO so it first
verifies alignment (ensure block_size and computed actual_gpu_ptr are aligned to
512/4096 as required) and if not aligned or if open with O_DIRECT or the cuFile
calls return EINVAL, retry the operation by reopening the file without O_DIRECT
and performing standard buffered I/O as a fallback; add explicit checks around
the arithmetic that computes actual_gpu_ptr (referencing tensor.data_ptr() and
gpu_block_idx) and improve error logging in the cuFileWrite/cuFileRead error
handlers to distinguish alignment errors vs other errno values so the code can
decide to retry without O_DIRECT.
- Around line 340-346: GdsFileIO::read_blocks_from_file currently ignores the
provided cudaStream_t and performs cuFileRead without GPU synchronization;
update this function to synchronize GPU work on the target device buffers
before/after the cuFileRead (mirror the write path and FileIO behavior) by
calling cudaStreamWaitEvent(stream, gpu_kvs_ready_event) before issuing the read
or calling cudaStreamSynchronize(stream) after the read completes (or both if
needed) to prevent concurrent kernel access to partially-written data; reference
GdsFileIO::read_blocks_from_file, gpu_kvs_ready_event, cudaStreamWaitEvent,
cudaStreamSynchronize, async_store_gpu_blocks, and FileIO::read_blocks_from_file
when making the change.

In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp`:
- Around line 310-327: The read path in storage_offload.cpp does not update
access time after a successful GDS read, which breaks LRU eviction; after a
successful TIME_EXPR_THROUGHPUT call that sets success from
m_read_handler->read_blocks_from_file (the block read branch in the function
where src_file and block_ids are used), call the FileIO::update_atime for the
same src_file (or invoke the appropriate update_atime method on the file IO
implementation such as GdsFileIO) before returning; ensure the update is
performed only when success is true and handle/log any update failures similarly
to the write path's update_atime usage so atime is kept in sync for LRU
eviction.

In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json`:
- Around line 1-12: The cufile_rdma.json template contains // comments which
make it invalid JSON for cuFile; update the file cufile_rdma.json so it is
strict JSON with all comments removed (leave only JSON key/value entries), or
instead convert this file into a .md (documentation) and add a separate valid
cufile_rdma.json without comments; ensure any references to CUFILE_ENV_PATH_JSON
in the content remain unchanged and the final JSON parses cleanly with no //
comment lines.

In `@kv_connectors/llmd_fs_backend/docs/gds.md`:
- Line 8: The link fragment in gds.md uses the wrong anchor
`#tuning-cufile-cufilejson`; update it to match the generated GitHub anchor for
the heading `## Tuning cuFile (`cufile_rdma.json`)` by changing the fragment to
`#tuning-cufile-cufile_rdmajson` (or alternatively rename the heading to remove
the underscore to match the existing link). Locate the incorrect fragment string
in the link and replace it with the corrected anchor
`#tuning-cufile-cufile_rdmajson` (or adjust the heading text `Tuning cuFile
(cufile_rdma.json)`) so the link resolves.

In `@kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py`:
- Line 253: The constructor for the worker class (the __init__ method in
llmd_fs_backend/worker.py) makes gds_mode a required parameter which breaks
callers; change the signature to give gds_mode a default of "disabled" (e.g.,
gds_mode: str = "disabled") so callers are backward compatible and consistent
with the C++ binding default, and update any internal references or docstrings
that assume a provided value if necessary.

---

Outside diff comments:
In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp`:
- Around line 131-142: The calc_staging_bytes function currently ignores the
gds_mode parameter and always computes a staging buffer size; update
StorageOffloadEngine::calc_staging_bytes to return 0 when gds_mode indicates
full-GDS/direct GDS transfer mode (per the header note "0 for full-GDS modes")
and only compute the byte-size loop for non-GDS modes; locate the function by
name and add a conditional branch that checks the appropriate GdsMode enum/flag
for full-GDS before performing the tensor stride*element_size accumulation so
staging memory is not allocated for direct-GDS transfers.

---

Nitpick comments:
In
`@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`:
- Around line 104-110: The CuFileApi constructor currently returns silently when
dlopen fails (m_handle == nullptr); update CuFileApi to call dlerror() and log
or print the returned error string when both dlopen("libcufile.so", RTLD_NOW)
and dlopen("libcufile.so.0", RTLD_NOW) fail so you get diagnostic output about
why the library couldn't be loaded; locate the CuFileApi constructor and add a
diagnostic message referencing dlerror() (and the attempted names "libcufile.so"
/ "libcufile.so.0") before returning so failures are visible at runtime.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp`:
- Around line 144-149: FileIO::update_atime currently calls utimensat and
ignores its return value; change it to check the return value and, on error
(return < 0), log a DEBUG-level message including the path and strerror(errno)
(or errno) to aid troubleshooting; ensure you include context (function
name/FileIO::update_atime and the path) in the log and keep behavior otherwise
unchanged (no exception thrown) so callers remain unaffected.
- Around line 66-74: The call to ofs.rdbuf()->pubsetbuf(...) occurs after the
std::ofstream is opened which is implementation-defined and may be ignored;
change to set the buffer on the streambuf before opening the file (use a
default-constructed std::ofstream ofs, call
ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE), then call
ofs.open(tmp_path, std::ios::out | std::ios::binary) and check the stream), or
alternatively use a constructor/approach that ensures the buffer is installed
prior to opening; update the code referencing tmp_path, ofs,
thread_write_buffer, and WRITE_BUFFER_SIZE accordingly.
- Around line 44-45: The thread-local tmp_file_suffix in file_io.cpp uses
std::random_device{}() which can collide across processes; modify construction
of tmp_file_suffix (the thread_local std::string) to include the current process
identifier plus a higher-entropy random value so suffixes are unique across
processes (e.g., concatenate "_" + process id + "_" + a 64-bit RNG value +
".tmp"). Use appropriate API for PID on each platform (getpid() on POSIX or
GetCurrentProcessId() on Windows) or fall back to
std::hash<std::thread::id>()/timestamp as needed, and seed a std::mt19937_64
with std::random_device for the random portion. Ensure the change preserves
thread_local semantics and the variable name tmp_file_suffix remains the same.

In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp`:
- Around line 54-57: The docstring in storage_offload_bindings.cpp concatenates
"read_ratio)" and "gds_mode:" without a newline; update the string literal so
there is a newline (e.g., "\n") before the "gds_mode: GDS operation mode..."
description to separate the paragraphs referring to
read_preferring_workers/read_ratio and gds_mode; locate the multi-line docstring
containing "read_preferring_workers" and "gds_mode" and insert the missing
newline in that literal.

In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json`:
- Line 69: The rdma_dev_addr_list JSON contains placeholder entries like
"<RDMA_NIC_IP_1>" which will break cuFile parsing; add a startup validation in
the GDS initialization path (e.g., the function that performs GDS
initialization/registration such as initGDS/initializeGDS or the routine that
loads cuFile/GDS config) that inspects the rdma_dev_addr_list array and fails
early or logs a clear error if any entry matches the placeholder pattern (e.g.,
contains '<' or matches /<.*>/ or known tokens like "<RDMA_NIC_IP_"),
instructing the user to replace them; make this validation produce a fatal error
or warning with actionable text so misconfigured JSON is detected before cuFile
is invoked.

In `@kv_connectors/llmd_fs_backend/docs/gds.md`:
- Around line 76-78: The markdown example output blocks for GdsFileIO and the
StorageOffloadEngine logs are missing a fenced-code language; update the two
fenced blocks containing "GdsFileIO: GPUDirect Storage (GDS) enabled" and the
block with "StorageOffloadEngine: GDS initialization failed..." /
"StorageOffloadEngine: READ=CPU WRITE=CPU" to use a language identifier (e.g.,
add ```text or ```log) so the code fences include a language for proper linting
and rendering; locate the blocks by the unique text snippets "GdsFileIO:
GPUDirect Storage (GDS) enabled" and "StorageOffloadEngine: GDS initialization
failed, falling back to CPU_BUFFER_STAGE" when making the change.

In `@kv_connectors/llmd_fs_backend/tests/test_gds_backend.py`:
- Around line 30-34: The test currently swallows all errors in the try/except
around the subprocess.run call that sets nvidia_fs_loaded, making failures
silent; update the except block in the same try that invokes
subprocess.run(["lsmod"]) to log the caught exception (include the exception
details/traceback) before setting nvidia_fs_loaded = False — use the project
logger or Python's logging with logger.exception() or logging.error(...,
exc_info=True) so the failure reason is recorded for debugging while preserving
the fallback behavior.
- Around line 60-73: The two functions call _collect_gds_status() twice causing
redundant work; change check_gds_available() to return the collected status dict
(or store it in a module-level cached variable) instead of just a bool, then
update get_gds_status_message() and test_gds_available() to reuse that returned
status (or read the cached value) so _collect_gds_status() is invoked only once;
reference the functions check_gds_available, get_gds_status_message, and
_collect_gds_status when making the changes and adjust test_gds_available() to
accept the new return shape or use the cache.

In `@pkg/kvevents/engineadapter/vllm_adapter.go`:
- Around line 293-316: The toInt function is missing a case for the native Go
int type which causes "unsupported numeric type: int" errors; update the type
switch in toInt (function toInt) to include a case for int that returns int(v),
nil (you can add a comment similar to the uint64 case about overflow not being a
concern) so the function handles int values alongside int8/int16/int32/int64 and
the unsigned variants.

In `@pkg/tokenization/uds_tokenizer.go`:
- Around line 76-78: The warmup() function currently hardcodes a 30*time.Second
context timeout; replace that literal with the mmTimeout constant to keep
timeout values consistent. Locate the context creation in warmup() (e.g.,
context.WithTimeout(..., 30*time.Second)) and change the timeout argument to
mmTimeout so the function uses the shared mmTimeout variable defined near the
top of the file.
- Around line 284-291: The RenderChat call is creating a new context from
context.Background(), ignoring upstream deadlines; change the API to accept a
parent context.Context and derive the timeout from it (use ctx, cancel :=
context.WithTimeout(parentCtx, timeout)), update the Tokenizer interface
signature (Tokenizer.RenderChat) and all callers to pass through the caller's
context, and ensure cancel() is deferred; if changing signatures is out of
scope, add a clear code comment on RenderChat and the Tokenizer interface
describing this limitation and the potential resource leak so it is documented
for future work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 3b385448-d5b4-451c-88f4-383658914f72

📥 Commits

Reviewing files that changed from the base of the PR and between 4b96290 and 81290a7.

📒 Files selected for processing (33)
  • .gitignore
  • kv_connectors/llmd_fs_backend/Makefile
  • kv_connectors/llmd_fs_backend/README.md
  • kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp
  • kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp
  • kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.hpp
  • kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp
  • kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.hpp
  • kv_connectors/llmd_fs_backend/csrc/storage/file_io.hpp
  • kv_connectors/llmd_fs_backend/csrc/storage/logger.hpp
  • kv_connectors/llmd_fs_backend/csrc/storage/storage_handler.hpp
  • kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp
  • kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.hpp
  • kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp
  • kv_connectors/llmd_fs_backend/csrc/storage/storage_types.hpp
  • kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp
  • kv_connectors/llmd_fs_backend/csrc/storage/thread_pool.cpp
  • kv_connectors/llmd_fs_backend/docs/cufile_rdma.json
  • kv_connectors/llmd_fs_backend/docs/gds.md
  • kv_connectors/llmd_fs_backend/llmd_fs_backend/spec.py
  • kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py
  • kv_connectors/llmd_fs_backend/setup.py
  • kv_connectors/llmd_fs_backend/tests/conftest.py
  • kv_connectors/llmd_fs_backend/tests/test_fs_backend.py
  • kv_connectors/llmd_fs_backend/tests/test_gds_backend.py
  • pkg/kvevents/engineadapter/common.go
  • pkg/kvevents/engineadapter/sglang_adapter.go
  • pkg/kvevents/engineadapter/sglang_adapter_test.go
  • pkg/kvevents/engineadapter/vllm_adapter.go
  • pkg/kvevents/engineadapter/vllm_adapter_bench_test.go
  • pkg/kvevents/engineadapter/vllm_adapter_test.go
  • pkg/tokenization/uds_tokenizer.go
  • tests/e2e/uds_tokenizer/uds_e2e_test.go
💤 Files with no reviewable changes (1)
  • kv_connectors/llmd_fs_backend/csrc/storage/file_io.hpp

Comment on lines +40 to +58
// File descriptor passed to cuFileHandleRegister (layout must match cufile.h)
struct CUfileDescr_t {
CUfileFileHandleType type;
union {
int fd;
void* handle;
} handle;
const CUfileFSOps_t* fs_ops;
};

// Opaque handle returned by cuFileHandleRegister
typedef void* CUfileHandle_t;

// Driver properties returned by cuFileDriverGetProperties
struct CUfileDrvProps_t {
size_t max_device_cache_size;
size_t max_device_pinned_mem_size;
char _reserved[256];
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

What is the struct layout for CUfileDescr_t and CUfileDrvProps_t in NVIDIA cuFile library version 1.x?

💡 Result:

The struct layouts for CUfileDescr_t and CUfileDrvProps_t in NVIDIA cuFile library version 1.x, as documented in official NVIDIA GPUDirect Storage cuFile API references (across CUDA 11.4-11.8 and current GDS v1.16 docs), are as follows: CUfileDescr_t: typedef struct CUfileDescr_t { CUfileFileHandleType type; /* type of file being registered / union { int fd; / Linux */ void handle; / Windows */ } handle; const CUfileFSOps_t fs_ops; / file system operation table / } CUfileDescr_t; CUfileDrvProps_t: typedef struct CUfileDrvProps { struct { unsigned int major_version; unsigned int minor_version; size_t poll_thresh_size; size_t max_direct_io_size; unsigned int dstatusflags; unsigned int dcontrolflags; } nvfs; CUfileFeatureFlags_t fflags; unsigned int max_device_cache_size; unsigned int per_buffer_cache_size; unsigned int max_pinned_memory_size; unsigned int max_batch_io_timeout_msecs; / present in some versions like 11.6 */ } CUfileDrvProps_t; Note: Minor variations exist across exact versions (e.g., max_batch_io_timeout_msecs appears in CUDA 11.6+ docs), but the core layout is consistent. The structure is self-describing per cuFileDriverGetProperties documentation. Layouts match across PDF docs, HTML API guides, and manpages for libcufile-dev 1.5+.

Citations:


Critical ABI mismatch in CUfileDrvProps_t struct definition—will cause memory corruption.

The CUfileDrvProps_t struct defined in this file does not match NVIDIA's official cuFile API definition. According to NVIDIA's cuFile documentation (CUDA 11.4–11.8, GDS v1.16), the actual struct contains:

  • A nested nvfs struct with version and configuration fields
  • CUfileFeatureFlags_t fflags
  • unsigned int field types (not size_t)
  • Fields named max_pinned_memory_size and per_buffer_cache_size (not max_device_pinned_mem_size)
  • Optional max_batch_io_timeout_msecs in newer versions

The code's simplified definition with _reserved[256] padding will misalign with the actual library structure, causing memory corruption when cuFileDriverGetProperties() writes to it. Fix struct definition to match NVIDIA's headers exactly, or use the official NVIDIA-provided header directly instead of recreating it. CUfileDescr_t is correctly defined.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`
around lines 40 - 58, The CUfileDrvProps_t definition here mismatches the
official NVIDIA cuFile ABI and can corrupt memory when
cuFileDriverGetProperties() populates it; replace the local struct with the
exact definition from NVIDIA's cuFile headers (or include the vendor header
directly) so fields like the nested nvfs struct, CUfileFeatureFlags_t fflags,
unsigned int-typed fields (e.g., max_pinned_memory_size, per_buffer_cache_size),
and any optional max_batch_io_timeout_msecs are present and correctly typed and
ordered; keep CUfileDescr_t as-is but remove or replace the _reserved[256]
placeholder and ensure CUfileDrvProps_t matches the vendor header exactly.

Comment on lines +32 to +34
// Thread-local unique suffix for temporary files (matches file_io.cpp pattern)
thread_local std::string gds_tmp_suffix =
"_" + std::to_string(std::random_device{}()) + ".tmp";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Thread-local temp suffix may collide across threads.

std::random_device{}() returns a 32-bit value. With many threads, birthday paradox makes collisions more likely. If two threads get the same suffix and write to the same file_path, they'll use the same temp file, causing data corruption or race conditions.

Include thread ID for uniqueness
-thread_local std::string gds_tmp_suffix =
-    "_" + std::to_string(std::random_device{}()) + ".tmp";
+thread_local std::string gds_tmp_suffix =
+    "_" + std::to_string(std::random_device{}()) + "_" +
+    std::to_string(std::hash<std::thread::id>{}(std::this_thread::get_id())) +
+    ".tmp";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`
around lines 32 - 34, The thread-local temporary suffix gds_tmp_suffix is
vulnerable to collisions because it uses only std::random_device{}() (32-bit);
change its generation to include the calling thread's identity and stronger
uniqueness (for example combine std::this_thread::get_id() and a per-thread or
global atomic counter or a larger random value) so each thread produces a
distinct suffix; update the thread_local gds_tmp_suffix initialization (the
symbol gds_tmp_suffix in gds_file_io.cpp) to concatenate thread id and an added
unique token to avoid temp file collisions when writing the same file_path.

Comment on lines +215 to +228
for (size_t i = 0; i < num_chunks; i++) {
void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size);

CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr,
chunk_size,
CU_FILE_RDMA_REGISTER);
if (status.err != CU_FILE_SUCCESS) {
FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block "
<< i << " with error code: " << status.err);
return false;
}

m_registered_buffers[block_ptr] = chunk_size;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Buffer overread in chunked registration when size is not a multiple of chunk_size.

When size is not evenly divisible by chunk_size, the final iteration registers chunk_size bytes starting at block_ptr, but only size - (i * chunk_size) bytes remain. This causes cuFileBufRegister to access memory beyond the buffer boundary (undefined behavior).

Fix: clamp the last chunk size
   for (size_t i = 0; i < num_chunks; i++) {
     void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size);
+    size_t this_chunk_size = std::min(chunk_size, size - (i * chunk_size));

     CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr,
-                                                      chunk_size,
+                                                      this_chunk_size,
                                                       CU_FILE_RDMA_REGISTER);
     if (status.err != CU_FILE_SUCCESS) {
       FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block "
                   << i << " with error code: " << status.err);
       return false;
     }

-    m_registered_buffers[block_ptr] = chunk_size;
+    m_registered_buffers[block_ptr] = this_chunk_size;
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (size_t i = 0; i < num_chunks; i++) {
void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size);
CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr,
chunk_size,
CU_FILE_RDMA_REGISTER);
if (status.err != CU_FILE_SUCCESS) {
FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block "
<< i << " with error code: " << status.err);
return false;
}
m_registered_buffers[block_ptr] = chunk_size;
}
for (size_t i = 0; i < num_chunks; i++) {
void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size);
size_t this_chunk_size = std::min(chunk_size, size - (i * chunk_size));
CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr,
this_chunk_size,
CU_FILE_RDMA_REGISTER);
if (status.err != CU_FILE_SUCCESS) {
FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block "
<< i << " with error code: " << status.err);
return false;
}
m_registered_buffers[block_ptr] = this_chunk_size;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`
around lines 215 - 228, The loop that registers GPU buffer chunks (using
num_chunks, chunk_size, size, block_ptr and calling m_cufile.cuFileBufRegister)
can overread when size isn't a multiple of chunk_size; change the registration
to compute remaining = size - (i * chunk_size) and set register_len =
std::min(chunk_size, remaining), call cuFileBufRegister with register_len (not
chunk_size) and store register_len in m_registered_buffers[block_ptr] so the
last block is clamped to the actual remaining bytes.

Comment on lines +259 to +266
// O_RDWR required by cuFile for internal DMA setup even on write-only paths
int fd = open(tmp_path.c_str(), O_RDWR | O_CREAT | O_DIRECT, 0644);
if (fd < 0) {
FS_LOG_ERROR("GdsFileIO: Failed to open temporary file "
<< tmp_path << ": " << std::strerror(errno)
<< " (errno=" << errno << ")");
return false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if cuFile documentation specifies alignment requirements for O_DIRECT
rg -n -i "align|O_DIRECT" --type=cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 357


🏁 Script executed:

#!/bin/bash
# Read the function context around lines 259-266
sed -n '240,280p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1624


🏁 Script executed:

#!/bin/bash
# Search for cuFileWrite and cuFileRead calls to see how they use file descriptors
rg -n -A 5 "cuFileWrite|cuFileRead" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 7682


🏁 Script executed:

#!/bin/bash
# Check for any alignment-related code or buffers
rg -n -i "alignment|aligned|align" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55


🏁 Script executed:

#!/bin/bash
# Find where actual_gpu_ptr is defined/set
rg -n "actual_gpu_ptr" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 576


🏁 Script executed:

#!/bin/bash
# Check the tensor_copier to see if it guarantees any alignment
rg -n -i "tensor_copier|get_tensors" --type=cpp -A 3 kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/ | head -60

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 3556


🏁 Script executed:

#!/bin/bash
# Search for any cuFile documentation or alignment mentions in header files
rg -n -i "align|buffer.*requirement|DMA" --type=hpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 100


🏁 Script executed:

#!/bin/bash
# View the actual_gpu_ptr assignment with context
sed -n '285,310p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1203


🏁 Script executed:

#!/bin/bash
# View the read-side assignment
sed -n '377,405p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1274


🏁 Script executed:

#!/bin/bash
# Find and examine tensor_copier.hpp
find . -name "tensor_copier.hpp" -o -name "tensor_copier.h" | head -5

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 135


🏁 Script executed:

#!/bin/bash
# Search for cuFileBufRegister to see if GPU buffers are registered with alignment info
rg -n "cuFileBufRegister" --type=cpp -B 2 -A 5

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 6897


🏁 Script executed:

#!/bin/bash
# View tensor_copier.hpp
cat kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp | head -100

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 2196


🏁 Script executed:

#!/bin/bash
# Search for error code handling or EINVAL
rg -n "EINVAL|error.*code|errno.*22" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1146


🏁 Script executed:

#!/bin/bash
# Look for block_size documentation or constraints
rg -n -B 2 -A 2 "block_size" kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp | head -40

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 433


Verify block_size alignment or implement fallback to non-O_DIRECT I/O.

O_DIRECT requires buffers aligned to filesystem block boundaries (512 bytes or 4KB). Line 293 calculates actual_gpu_ptr = tensor.data_ptr() + (gpu_block_idx * block_size). If block_size is not divisible by 512 or 4K, this arithmetic produces misaligned pointers, causing cuFileWrite/cuFileRead to fail with EINVAL on certain filesystems. Additionally, error handling (lines 304-307, 396-399) logs generic error codes without distinguishing alignment failures, preventing graceful degradation. Either assert that block_size meets alignment constraints, or implement fallback to standard I/O when O_DIRECT fails.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`
around lines 259 - 266, The current use of O_DIRECT when opening the temp file
(open(..., O_DIRECT)) can cause EINVAL in cuFileRead/cuFileWrite if
tensor.data_ptr() + (gpu_block_idx * block_size) or block_size is not aligned to
filesystem block boundaries; update the Open+IO path in GdsFileIO so it first
verifies alignment (ensure block_size and computed actual_gpu_ptr are aligned to
512/4096 as required) and if not aligned or if open with O_DIRECT or the cuFile
calls return EINVAL, retry the operation by reopening the file without O_DIRECT
and performing standard buffered I/O as a fallback; add explicit checks around
the arithmetic that computes actual_gpu_ptr (referencing tensor.data_ptr() and
gpu_block_idx) and improve error logging in the cuFileWrite/cuFileRead error
handlers to distinguish alignment errors vs other errno values so the code can
decide to retry without O_DIRECT.

Comment on lines +340 to +346
// StorageHandler interface: Read blocks from file
bool GdsFileIO::read_blocks_from_file(const std::string& file_path,
const std::vector<int64_t>& block_ids,
cudaStream_t stream) {
// Each ThreadPool thread has its own CUDA stream, but cuFileRead is
// synchronous and operates directly on the device — no stream needed.
(void)stream;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how async_load_gpu_blocks callers synchronize
ast-grep --pattern 'async_load_gpu_blocks($$$)'
rg -n -B10 -A5 'async_load_gpu_blocks'

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 6845


🏁 Script executed:

# Get the full implementation of async_load_gpu_blocks to understand synchronization
sed -n '286,350p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 2442


🏁 Script executed:

# Also get the write path implementation for comparison
sed -n '210,280p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 3062


🏁 Script executed:

# Check for synchronization calls related to gpu_kvs_ready_event
rg -n 'gpu_kvs_ready_event' kv_connectors/llmd_fs_backend/csrc/storage/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 468


🏁 Script executed:

# Check how read_blocks_from_file is called within async_load_gpu_blocks
rg -n 'read_blocks_from_file' kv_connectors/llmd_fs_backend/csrc/storage/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1184


🏁 Script executed:

# Check the wait_job and get_finished implementations
sed -n '130,180p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1659


🏁 Script executed:

# Check how Python caller waits after async_load_gpu_blocks
sed -n '230,260p' kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1128


🏁 Script executed:

# Check get_finished implementation
rg -n 'get_finished' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp -A10

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 439


🏁 Script executed:

# Check if ThreadPool's TLS stream is related to CUDA default stream
rg -n 'get_tls_stream\|ThreadPool' kv_connectors/llmd_fs_backend/csrc/storage/ -A5 | head -80

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55


🏁 Script executed:

# Get more context on wait_job implementation
sed -n '170,200p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 802


🏁 Script executed:

# Check GdsFileIO read_blocks_from_file full implementation
sed -n '340,400p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 2520


🏁 Script executed:

# Check if there's stream synchronization after async jobs
rg -n 'cudaStreamSynchronize\|cudaStreamWaitEvent' kv_connectors/llmd_fs_backend/csrc/storage/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55


🏁 Script executed:

# Check how the loaded blocks are accessed in Python after recording job
rg -n '_record_job\|wait_job\|get_finished' kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py -B3 -A10

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55


🏁 Script executed:

# Check the end of read_blocks_from_file to see full implementation
sed -n '390,430p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1463


🏁 Script executed:

# Check FileIO::read_blocks_from_file for comparison (non-GDS variant)
sed -n '190,230p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 1348


🏁 Script executed:

# Check what happens after _record_job is called - trace the control flow
grep -n '_record_job\|wait_job\|get_finished' kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 404


🏁 Script executed:

# Check if there's any explicit synchronization at the scheduler level
rg -n 'wait_job\|get_finished' kv_connectors/llmd_fs_backend/llmd_fs_backend/ -B5 -A5

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55


🏁 Script executed:

# Check scheduler/high-level usage to see when buffers become accessible
rg -n 'get_finished\|wait_job' kv_connectors/llmd_fs_backend/llmd_fs_backend/ -B10 -A10

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55


🏁 Script executed:

# Check if there's any CUDA synchronization between wait_job and actual GPU access
rg -n 'torch.cuda\|cuda_stream\|synchronize' kv_connectors/llmd_fs_backend/llmd_fs_backend/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 55


🏁 Script executed:

# Compare the two read implementations side by side - check for stream sync
rg -n 'cudaStreamSynchronize' kv_connectors/llmd_fs_backend/csrc/storage/backends/

Repository: opendatahub-io/llm-d-kv-cache

Length of output: 615


GdsFileIO read path ignores CUDA stream and lacks GPU synchronization, creating potential data race with concurrent GPU access.

GdsFileIO::read_blocks_from_file (line 340-346) discards the stream parameter with (void)stream; and performs cuFileRead directly without stream synchronization. This contrasts with the write path (async_store_gpu_blocks), which waits on gpu_kvs_ready_event before write operations start, and with FileIO::read_blocks_from_file, which calls cudaStreamSynchronize(stream) after read completes (line 217). Since cuFileRead is synchronous but not stream-aware, concurrent GPU kernels reading from the target buffers during or shortly after the read operation will access stale or partially-written data, risking corruption.

Add pre-read synchronization via cudaStreamWaitEvent (like the write path) or post-read cudaStreamSynchronize (like FileIO) to ensure GPU work on the buffers completes before the read begins.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`
around lines 340 - 346, GdsFileIO::read_blocks_from_file currently ignores the
provided cudaStream_t and performs cuFileRead without GPU synchronization;
update this function to synchronize GPU work on the target device buffers
before/after the cuFileRead (mirror the write path and FileIO behavior) by
calling cudaStreamWaitEvent(stream, gpu_kvs_ready_event) before issuing the read
or calling cudaStreamSynchronize(stream) after the read completes (or both if
needed) to prevent concurrent kernel access to partially-written data; reference
GdsFileIO::read_blocks_from_file, gpu_kvs_ready_event, cudaStreamWaitEvent,
cudaStreamSynchronize, async_store_gpu_blocks, and FileIO::read_blocks_from_file
when making the change.

Comment on lines +310 to +327
// Execute the read operation using polymorphic storage handler
try {
// Stage 1: Read file to staging CPU tensor.
// Read data from disk into a tensor.
success = TIME_EXPR("read phase 1: read_buffer_from_file",
read_buffer_from_file(src_file, buf),
"file:",
src_file);
size_t total_size =
block_ids.size() * m_tensor_copier.get_block_size();
success = TIME_EXPR_THROUGHPUT(
"read: storage handler",
m_read_handler->read_blocks_from_file(src_file,
block_ids,
tls_stream.stream()),
total_size,
"file:",
src_file,
" blocks:",
block_ids.size());
if (!success) {
FS_LOG_ERROR("Stage1 read_buffer_from_file failed for "
<< src_file);
return success;
}
// Stage 2: copy tensors from staging CPU tensor to GPU.
// Perform asynchronous GPU copy and tensor swap.
auto* cpu_base = static_cast<uint8_t*>(buf.ptr);
bool is_store = false;
// Execute the copy operation
success = TIME_EXPR(
"read phase 2: copy_cpu_tensor_to_gpu_tensors",
m_tensor_copier.copy_blocks(cpu_base, block_ids, is_store),
"file: ",
src_file);

auto& tls_stream = ThreadPool::get_tls_stream();
cudaError_t err = cudaStreamSynchronize(tls_stream.stream());
if (err != cudaSuccess) {
FS_LOG_ERROR(
"cudaStreamSynchronize failed: " << cudaGetErrorString(err));
return false;
FS_LOG_ERROR("Load failed for " << src_file);
}
return success;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing atime update after GDS read operations breaks LRU eviction.

The write path calls FileIO::update_atime() when a file already exists (line 231), but the read path never updates atime. When m_read_handler is a GdsFileIO instance, files read via GDS won't have their access times updated, breaking LRU cache eviction logic that depends on atime. As noted in relevant code snippet from storage_offload.cpp:313-321, no update_atime() call follows the read operation.

Add atime update after successful read
             success = TIME_EXPR_THROUGHPUT(
                 "read: storage handler",
                 m_read_handler->read_blocks_from_file(src_file,
                                                       block_ids,
                                                       tls_stream.stream()),
                 total_size,
                 "file:",
                 src_file,
                 " blocks:",
                 block_ids.size());
             if (!success) {
               FS_LOG_ERROR("Load failed for " << src_file);
+            } else {
+              FileIO::update_atime(src_file);
             }
             return success;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp` around lines
310 - 327, The read path in storage_offload.cpp does not update access time
after a successful GDS read, which breaks LRU eviction; after a successful
TIME_EXPR_THROUGHPUT call that sets success from
m_read_handler->read_blocks_from_file (the block read branch in the function
where src_file and block_ids are used), call the FileIO::update_atime for the
same src_file (or invoke the appropriate update_atime method on the file IO
implementation such as GdsFileIO) before returning; ensure the update is
performed only when success is true and handle/log any update failures similarly
to the write path's update_atime usage so atime is kept in sync for LRU
eviction.

Comment on lines +1 to +12
{
// Optimized cuFile configuration for RDMA-based storage (NVMe-oF, BeeGFS, Lustre, GPFS, Weka).
//
// REQUIRED: Set rdma_dev_addr_list to the IP addresses of your RDMA NICs.
// To find them: `ibdev2netdev` or `ip addr show` (look for your RDMA interface IPs).
//
// Usage:
// export CUFILE_ENV_PATH_JSON=/path/to/cufile_rdma.json
//
// NOTE: Application can override custom configuration via export CUFILE_ENV_PATH_JSON=<filepath>
// e.g : export CUFILE_ENV_PATH_JSON="/home/<xxx>/cufile_rdma.json"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does NVIDIA cuFile configuration support JSON with comments?

💡 Result:

No, NVIDIA cuFile configuration does not support JSON with comments. Official NVIDIA documentation for cuFile shows the /etc/cufile.json configuration file using standard JSON syntax with inline comments in example code (e.g., result 1: "// specify list of file-systems"), but these appear as documentation artifacts rather than parsed comments. Standard JSON parsers reject comments, and no cuFile-specific documentation or source evidence indicates use of a JSONC (JSON with comments) parser. RAPIDS cuDF JSON reader explicitly lacks comment support (result 16), reinforcing standard JSON expectations in NVIDIA ecosystem tools. cuFile is a proprietary closed-source library; no public source confirms custom parsing.

Citations:


Remove // comments—cuFile requires standard JSON and will fail to parse this configuration.

NVIDIA cuFile does not support JSON with comments. This template will generate parse errors when users attempt to use it. Either:

  1. Remove all // comments and keep as .json, or
  2. Provide cufile_rdma.json (valid JSON, no comments) alongside this file as .md documentation.

Users cannot directly use this as CUFILE_ENV_PATH_JSON without editing.

🧰 Tools
🪛 Biome (2.4.9)

[error] 2-2: Expected a property but instead found '// Optimized cuFile configuration for RDMA-based storage (NVMe-oF, BeeGFS, Lustre, GPFS, Weka).'.

(parse)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json` around lines 1 - 12, The
cufile_rdma.json template contains // comments which make it invalid JSON for
cuFile; update the file cufile_rdma.json so it is strict JSON with all comments
removed (leave only JSON key/value entries), or instead convert this file into a
.md (documentation) and add a separate valid cufile_rdma.json without comments;
ensure any references to CUFILE_ENV_PATH_JSON in the content remain unchanged
and the final JSON parses cleanly with no // comment lines.


GDS uses the **cuFile** library by default for all GPU↔storage transfers. cuFile behavior (I/O threads,
bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a `cufile_rdma.json`
config file — see [Tuning cuFile](#tuning-cufile-cufilejson) for an example.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Broken link fragment: heading anchor mismatch.

The link #tuning-cufile-cufilejson won't resolve to the heading at line 93 (## Tuning cuFile (\cufile_rdma.json`)). GitHub generates the anchor as #tuning-cufile-cufile_rdmajson` (with underscores from the filename).

Proposed fix
-bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a `cufile_rdma.json`
-config file — see [Tuning cuFile](`#tuning-cufile-cufilejson`) for an example.
+bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a `cufile_rdma.json`
+config file — see [Tuning cuFile](`#tuning-cufile-cufile_rdmajson`) for an example.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
config file — see [Tuning cuFile](#tuning-cufile-cufilejson) for an example.
config file — see [Tuning cuFile](`#tuning-cufile-cufile_rdmajson`) for an example.
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 8-8: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/docs/gds.md` at line 8, The link fragment in
gds.md uses the wrong anchor `#tuning-cufile-cufilejson`; update it to match the
generated GitHub anchor for the heading `## Tuning cuFile (`cufile_rdma.json`)`
by changing the fragment to `#tuning-cufile-cufile_rdmajson` (or alternatively
rename the heading to remove the underscore to match the existing link). Locate
the incorrect fragment string in the link and replace it with the corrected
anchor `#tuning-cufile-cufile_rdmajson` (or adjust the heading text `Tuning
cuFile (cufile_rdma.json)`) so the link resolves.

gpu_block_size: int,
gpu_blocks_per_file: int,
threads_per_gpu: int,
gds_mode: str,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing default value for gds_mode parameter.

The constructor lacks a default value for gds_mode, making it a required argument. This is a breaking change for any existing callers. Consider adding gds_mode: str = "disabled" for backward compatibility, consistent with the C++ binding default.

Proposed fix
     def __init__(
         self,
         kv_caches: dict[str, torch.Tensor],
         attn_backends: dict[str, type[AttentionBackend]],
         file_mapper: FileMapper,
         gpu_block_size: int,
         gpu_blocks_per_file: int,
         threads_per_gpu: int,
-        gds_mode: str,
+        gds_mode: str = "disabled",
         max_staging_memory_gb: int = DEFAULT_MAX_STAGING_MEMORY_GB,
         read_preferring_ratio: float = DEFAULT_READ_PREFERRING_WORKERS_RATIO,
     ):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
gds_mode: str,
gds_mode: str = "disabled",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py` at line 253, The
constructor for the worker class (the __init__ method in
llmd_fs_backend/worker.py) makes gds_mode a required parameter which breaks
callers; change the signature to give gds_mode a default of "disabled" (e.g.,
gds_mode: str = "disabled") so callers are backward compatible and consistent
with the C++ binding default, and update any internal references or docstrings
that assume a provided value if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants