[sync] upstream llm-d main branch 1c435cb [2026-04-02]#16
[sync] upstream llm-d main branch 1c435cb [2026-04-02]#16zdtsw wants to merge 6 commits intoopendatahub-io:main_2from
Conversation
Multimodal requests need to download and process images, which can exceed the 5s default timeout in CI. Use 30s for requests with structured content parts.
* feat: add GPUDirect Storage (GDS) support to llmd_fs_backend Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
Granite's built-in chat template injects the current date via strftime_now(), causing goldenChatTokenIDs to change daily. Add an explicit system message to the golden chat conversation so the template uses the provided content instead of generating a date-dependent default system prompt.
with vLLM event schema (llm-d#484) * fix: single-pass []any decode for forward/backward compat with vLLM event schema vLLM uses msgspec with array_like=True and omit_defaults=True, producing positional msgpack arrays where trailing fields may be absent. The previous typed-struct decode broke when vLLM appended new fields (old consumer fails) or when a newer consumer read from an older vLLM (shorter array than expected). Replace double-decode ([]any for tag + typed struct) with a single unmarshal into []any and positional extraction with length guards. Extra trailing fields from newer vLLM are silently ignored; missing trailing fields from older vLLM get zero values. * test: add decode benchmarks for vLLM event schema compat * fix: address review — reuse shared helpers, fix lint - Reuse convertBlockHashes() from common.go in both vLLM converters - Move convertExtraKeys() back to common.go (shared with SGLang) - Remove unused engineName param from decodeEvent() - Fix bench test lint: error checks, paramTypeCombine, appendCombine * fix: suppress gosec G115 in bench test data construction
📝 WalkthroughWalkthroughThis PR introduces GPUDirect Storage (GDS) support to enable direct GPU-to-file I/O via NVIDIA cuFile. Changes include: (1) dynamic runtime loading of Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Security & Design FindingsCWE-427 (Untrusted Search Path): CWE-252 (Unchecked Return Value): CWE-367 (Time-of-Check-Time-of-Use): Design Issue: Architectural Concern: 🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp (1)
131-142:⚠️ Potential issue | 🟠 Major
gds_modeparameter is unused; staging buffer allocated even for full-GDS modes.The function signature accepts
GdsMode gds_modebut never references it. Per the header comment ("0 for full-GDS modes"), this should return 0 whengds_modeindicates direct GDS transfers, avoiding unnecessary memory allocation.Proposed fix
size_t StorageOffloadEngine::calc_staging_bytes( int gpu_blocks_per_file, const std::vector<torch::Tensor>& tensors, GdsMode gds_mode) { + // Full-GDS modes don't need CPU staging buffers + if (gds_mode == GdsMode::READ_WRITE) { + return 0; + } size_t block_size_in_bytes = 0; for (const auto& tensor : tensors) { block_size_in_bytes += static_cast<size_t>(tensor.stride(0)) * static_cast<size_t>(tensor.element_size()); } return block_size_in_bytes * static_cast<size_t>(gpu_blocks_per_file); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp` around lines 131 - 142, The calc_staging_bytes function currently ignores the gds_mode parameter and always computes a staging buffer size; update StorageOffloadEngine::calc_staging_bytes to return 0 when gds_mode indicates full-GDS/direct GDS transfer mode (per the header note "0 for full-GDS modes") and only compute the byte-size loop for non-GDS modes; locate the function by name and add a conditional branch that checks the appropriate GdsMode enum/flag for full-GDS before performing the tensor stride*element_size accumulation so staging memory is not allocated for direct-GDS transfers.
🧹 Nitpick comments (12)
pkg/tokenization/uds_tokenizer.go (2)
76-78: ReusemmTimeoutinwarmup()for consistency.Line 200 hardcodes
30*time.Secondfor the warmup context timeout. Consider usingmmTimeoutto keep the value in sync.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tokenization/uds_tokenizer.go` around lines 76 - 78, The warmup() function currently hardcodes a 30*time.Second context timeout; replace that literal with the mmTimeout constant to keep timeout values consistent. Locate the context creation in warmup() (e.g., context.WithTimeout(..., 30*time.Second)) and change the timeout argument to mmTimeout so the function uses the shared mmTimeout variable defined near the top of the file.
284-291: Parent context deadline ignored; gRPC call may outlive caller's timeout.
context.Background()is hardcoded, so if upstream callers (e.g., HTTP handlers, orchestrators) have their own deadlines, this 30s timeout will not respect them. A slow backend could hold resources long after the original request is abandoned.Consider accepting a parent
context.Contextparameter and deriving the timeout from it:ctx, cancel := context.WithTimeout(parentCtx, timeout)This requires a signature change to
RenderChat(and theTokenizerinterface), which may be out of scope for this PR. If so, document this limitation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tokenization/uds_tokenizer.go` around lines 284 - 291, The RenderChat call is creating a new context from context.Background(), ignoring upstream deadlines; change the API to accept a parent context.Context and derive the timeout from it (use ctx, cancel := context.WithTimeout(parentCtx, timeout)), update the Tokenizer interface signature (Tokenizer.RenderChat) and all callers to pass through the caller's context, and ensure cancel() is deferred; if changing signatures is out of scope, add a clear code comment on RenderChat and the Tokenizer interface describing this limitation and the potential resource leak so it is documented for future work.pkg/kvevents/engineadapter/vllm_adapter.go (1)
293-316: Missingintcase in type switch.
toInthandlesint8-int64anduint8-uint64but not the baseinttype. While msgpack v5 typically decodes integers asint64/uint64, if any code path passes a Goint, this will fail with a confusing "unsupported numeric type: int" error.Add int case for completeness
func toInt(raw any) (int, error) { switch v := raw.(type) { + case int: + return v, nil case int64: return int(v), nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/kvevents/engineadapter/vllm_adapter.go` around lines 293 - 316, The toInt function is missing a case for the native Go int type which causes "unsupported numeric type: int" errors; update the type switch in toInt (function toInt) to include a case for int that returns int(v), nil (you can add a comment similar to the uint64 case about overflow not being a concern) so the function handles int values alongside int8/int16/int32/int64 and the unsigned variants.kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp (3)
144-149:utimensatreturn value ignored—silent failure on atime update.If
utimensatfails (e.g., permissions, non-existent file), the error is silently discarded. Consider logging at DEBUG level for troubleshooting.Proposed fix
void FileIO::update_atime(const std::string& path) { struct timespec times[2]; times[0].tv_nsec = UTIME_NOW; // atime → now times[1].tv_nsec = UTIME_OMIT; // mtime → unchanged - utimensat(AT_FDCWD, path.c_str(), times, 0); + if (utimensat(AT_FDCWD, path.c_str(), times, 0) != 0) { + FS_LOG_DEBUG("Failed to update atime for: " << path << " - " + << std::strerror(errno)); + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp` around lines 144 - 149, FileIO::update_atime currently calls utimensat and ignores its return value; change it to check the return value and, on error (return < 0), log a DEBUG-level message including the path and strerror(errno) (or errno) to aid troubleshooting; ensure you include context (function name/FileIO::update_atime and the path) in the log and keep behavior otherwise unchanged (no exception thrown) so callers remain unaffected.
66-74:pubsetbufafter opening has implementation-defined behavior.Per C++ standard, calling
pubsetbufafter a file is already open may be ignored by some implementations. Move the buffer setup beforeopen()or use the two-argumentstd::ofstreamconstructor after setting the buffer.Proposed fix
- std::ofstream ofs(tmp_path, std::ios::out | std::ios::binary); - if (!ofs) { + std::ofstream ofs; + ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE); + ofs.open(tmp_path, std::ios::out | std::ios::binary); + if (!ofs.is_open()) { FS_LOG_ERROR("Failed to open temporary file for writing: " << tmp_path << " - " << std::strerror(errno)); return false; } - - // Apply the custom buffer to the file stream - ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp` around lines 66 - 74, The call to ofs.rdbuf()->pubsetbuf(...) occurs after the std::ofstream is opened which is implementation-defined and may be ignored; change to set the buffer on the streambuf before opening the file (use a default-constructed std::ofstream ofs, call ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE), then call ofs.open(tmp_path, std::ios::out | std::ios::binary) and check the stream), or alternatively use a constructor/approach that ensures the buffer is installed prior to opening; update the code referencing tmp_path, ofs, thread_write_buffer, and WRITE_BUFFER_SIZE accordingly.
44-45: Potential tmp file collision across processes using same storage path.
std::random_device{}()returns a 32-bit value. While thread-local ensures per-thread uniqueness, multiple processes could generate identical suffixes. Consider including PID:Proposed fix
+#include <unistd.h> + thread_local std::string tmp_file_suffix = - "_" + std::to_string(std::random_device{}()) + ".tmp"; + "_" + std::to_string(getpid()) + "_" + + std::to_string(std::random_device{}()) + ".tmp";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp` around lines 44 - 45, The thread-local tmp_file_suffix in file_io.cpp uses std::random_device{}() which can collide across processes; modify construction of tmp_file_suffix (the thread_local std::string) to include the current process identifier plus a higher-entropy random value so suffixes are unique across processes (e.g., concatenate "_" + process id + "_" + a 64-bit RNG value + ".tmp"). Use appropriate API for PID on each platform (getpid() on POSIX or GetCurrentProcessId() on Windows) or fall back to std::hash<std::thread::id>()/timestamp as needed, and seed a std::mt19937_64 with std::random_device for the random portion. Ensure the change preserves thread_local semantics and the variable name tmp_file_suffix remains the same.kv_connectors/llmd_fs_backend/docs/cufile_rdma.json (1)
69-69: Placeholder values must be replaced before use.
rdma_dev_addr_listcontains<RDMA_NIC_IP_1>placeholders. This will cause cuFile parse failures if not replaced. The instructions at lines 4-5 are clear, but consider adding a startup validation in the GDS initialization code to detect and warn about placeholder values.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json` at line 69, The rdma_dev_addr_list JSON contains placeholder entries like "<RDMA_NIC_IP_1>" which will break cuFile parsing; add a startup validation in the GDS initialization path (e.g., the function that performs GDS initialization/registration such as initGDS/initializeGDS or the routine that loads cuFile/GDS config) that inspects the rdma_dev_addr_list array and fails early or logs a clear error if any entry matches the placeholder pattern (e.g., contains '<' or matches /<.*>/ or known tokens like "<RDMA_NIC_IP_"), instructing the user to replace them; make this validation produce a fatal error or warning with actionable text so misconfigured JSON is detected before cuFile is invoked.kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp (1)
54-57: Docstring formatting issue: missing newline beforegds_modedescription.Line 55-56 runs
read_ratio)directly intogds_mode:without a separating newline.Proposed fix
" read_preferring_workers: Number of workers that check " - " read queue first (calculated as int(io_threads * read_ratio) " - " gds_mode: GDS operation mode (see GdsMode in storage_types.hpp). " + " read queue first (calculated as int(io_threads * read_ratio)).\n" + " gds_mode: GDS operation mode (see GdsMode in storage_types.hpp). " "Defaults to 'disabled'.\n")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp` around lines 54 - 57, The docstring in storage_offload_bindings.cpp concatenates "read_ratio)" and "gds_mode:" without a newline; update the string literal so there is a newline (e.g., "\n") before the "gds_mode: GDS operation mode..." description to separate the paragraphs referring to read_preferring_workers/read_ratio and gds_mode; locate the multi-line docstring containing "read_preferring_workers" and "gds_mode" and insert the missing newline in that literal.kv_connectors/llmd_fs_backend/docs/gds.md (1)
76-78: Fenced code blocks should specify a language.Add
textorlogas the language identifier for example output blocks to satisfy markdown linting and improve rendering.Proposed fix
-``` +```text GdsFileIO: GPUDirect Storage (GDS) enabledIf GDS initialization fails, the connector automatically falls back to CPU staging and logs:
-
+text
StorageOffloadEngine: GDS initialization failed, falling back to CPU_BUFFER_STAGE for both READ and WRITE
StorageOffloadEngine: READ=CPU WRITE=CPUAlso applies to: 82-85
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/docs/gds.md` around lines 76 - 78, The markdown example output blocks for GdsFileIO and the StorageOffloadEngine logs are missing a fenced-code language; update the two fenced blocks containing "GdsFileIO: GPUDirect Storage (GDS) enabled" and the block with "StorageOffloadEngine: GDS initialization failed..." / "StorageOffloadEngine: READ=CPU WRITE=CPU" to use a language identifier (e.g., add ```text or ```log) so the code fences include a language for proper linting and rendering; locate the blocks by the unique text snippets "GdsFileIO: GPUDirect Storage (GDS) enabled" and "StorageOffloadEngine: GDS initialization failed, falling back to CPU_BUFFER_STAGE" when making the change.kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp (1)
104-110: Silent failure whendlopenfails; no diagnostic output.When
libcufile.socannot be loaded, the code returns silently withm_handle = nullptr. Addingdlerror()output would help diagnose missing library paths or dependency issues in production.Log dlopen failure reason
CuFileApi() { m_handle = dlopen("libcufile.so", RTLD_NOW); if (!m_handle) { // Try versioned name m_handle = dlopen("libcufile.so.0", RTLD_NOW); } - if (!m_handle) return; + if (!m_handle) { + // Optionally log: dlerror() contains the failure reason + return; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp` around lines 104 - 110, The CuFileApi constructor currently returns silently when dlopen fails (m_handle == nullptr); update CuFileApi to call dlerror() and log or print the returned error string when both dlopen("libcufile.so", RTLD_NOW) and dlopen("libcufile.so.0", RTLD_NOW) fail so you get diagnostic output about why the library couldn't be loaded; locate the CuFileApi constructor and add a diagnostic message referencing dlerror() (and the attempted names "libcufile.so" / "libcufile.so.0") before returning so failures are visible at runtime.kv_connectors/llmd_fs_backend/tests/test_gds_backend.py (2)
30-34: Exception handling swallows all errors silently.Catching bare
Exceptionand returningFalseloses diagnostic information. Iflsmodorldconfigfails for unexpected reasons (e.g., permissions), the failure reason is hidden.Log the exception for debugging
try: result = subprocess.run(["lsmod"], capture_output=True, text=True, check=False) nvidia_fs_loaded = "nvidia_fs" in result.stdout - except Exception: + except Exception as e: + # Log for debugging; proceed without nvidia_fs detection nvidia_fs_loaded = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/tests/test_gds_backend.py` around lines 30 - 34, The test currently swallows all errors in the try/except around the subprocess.run call that sets nvidia_fs_loaded, making failures silent; update the except block in the same try that invokes subprocess.run(["lsmod"]) to log the caught exception (include the exception details/traceback) before setting nvidia_fs_loaded = False — use the project logger or Python's logging with logger.exception() or logging.error(..., exc_info=True) so the failure reason is recorded for debugging while preserving the fallback behavior.
60-73: Minor inefficiency: duplicate status collection.
check_gds_available()andget_gds_status_message()both call_collect_gds_status(). Intest_gds_available()(lines 81-82), both are invoked, resulting in redundant subprocess calls. Consider returning status fromcheck_gds_available()or caching the result.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv_connectors/llmd_fs_backend/tests/test_gds_backend.py` around lines 60 - 73, The two functions call _collect_gds_status() twice causing redundant work; change check_gds_available() to return the collected status dict (or store it in a module-level cached variable) instead of just a bool, then update get_gds_status_message() and test_gds_available() to reuse that returned status (or read the cached value) so _collect_gds_status() is invoked only once; reference the functions check_gds_available, get_gds_status_message, and _collect_gds_status when making the changes and adjust test_gds_available() to accept the new return shape or use the cache.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`:
- Around line 40-58: The CUfileDrvProps_t definition here mismatches the
official NVIDIA cuFile ABI and can corrupt memory when
cuFileDriverGetProperties() populates it; replace the local struct with the
exact definition from NVIDIA's cuFile headers (or include the vendor header
directly) so fields like the nested nvfs struct, CUfileFeatureFlags_t fflags,
unsigned int-typed fields (e.g., max_pinned_memory_size, per_buffer_cache_size),
and any optional max_batch_io_timeout_msecs are present and correctly typed and
ordered; keep CUfileDescr_t as-is but remove or replace the _reserved[256]
placeholder and ensure CUfileDrvProps_t matches the vendor header exactly.
In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`:
- Around line 215-228: The loop that registers GPU buffer chunks (using
num_chunks, chunk_size, size, block_ptr and calling m_cufile.cuFileBufRegister)
can overread when size isn't a multiple of chunk_size; change the registration
to compute remaining = size - (i * chunk_size) and set register_len =
std::min(chunk_size, remaining), call cuFileBufRegister with register_len (not
chunk_size) and store register_len in m_registered_buffers[block_ptr] so the
last block is clamped to the actual remaining bytes.
- Around line 32-34: The thread-local temporary suffix gds_tmp_suffix is
vulnerable to collisions because it uses only std::random_device{}() (32-bit);
change its generation to include the calling thread's identity and stronger
uniqueness (for example combine std::this_thread::get_id() and a per-thread or
global atomic counter or a larger random value) so each thread produces a
distinct suffix; update the thread_local gds_tmp_suffix initialization (the
symbol gds_tmp_suffix in gds_file_io.cpp) to concatenate thread id and an added
unique token to avoid temp file collisions when writing the same file_path.
- Around line 259-266: The current use of O_DIRECT when opening the temp file
(open(..., O_DIRECT)) can cause EINVAL in cuFileRead/cuFileWrite if
tensor.data_ptr() + (gpu_block_idx * block_size) or block_size is not aligned to
filesystem block boundaries; update the Open+IO path in GdsFileIO so it first
verifies alignment (ensure block_size and computed actual_gpu_ptr are aligned to
512/4096 as required) and if not aligned or if open with O_DIRECT or the cuFile
calls return EINVAL, retry the operation by reopening the file without O_DIRECT
and performing standard buffered I/O as a fallback; add explicit checks around
the arithmetic that computes actual_gpu_ptr (referencing tensor.data_ptr() and
gpu_block_idx) and improve error logging in the cuFileWrite/cuFileRead error
handlers to distinguish alignment errors vs other errno values so the code can
decide to retry without O_DIRECT.
- Around line 340-346: GdsFileIO::read_blocks_from_file currently ignores the
provided cudaStream_t and performs cuFileRead without GPU synchronization;
update this function to synchronize GPU work on the target device buffers
before/after the cuFileRead (mirror the write path and FileIO behavior) by
calling cudaStreamWaitEvent(stream, gpu_kvs_ready_event) before issuing the read
or calling cudaStreamSynchronize(stream) after the read completes (or both if
needed) to prevent concurrent kernel access to partially-written data; reference
GdsFileIO::read_blocks_from_file, gpu_kvs_ready_event, cudaStreamWaitEvent,
cudaStreamSynchronize, async_store_gpu_blocks, and FileIO::read_blocks_from_file
when making the change.
In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp`:
- Around line 310-327: The read path in storage_offload.cpp does not update
access time after a successful GDS read, which breaks LRU eviction; after a
successful TIME_EXPR_THROUGHPUT call that sets success from
m_read_handler->read_blocks_from_file (the block read branch in the function
where src_file and block_ids are used), call the FileIO::update_atime for the
same src_file (or invoke the appropriate update_atime method on the file IO
implementation such as GdsFileIO) before returning; ensure the update is
performed only when success is true and handle/log any update failures similarly
to the write path's update_atime usage so atime is kept in sync for LRU
eviction.
In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json`:
- Around line 1-12: The cufile_rdma.json template contains // comments which
make it invalid JSON for cuFile; update the file cufile_rdma.json so it is
strict JSON with all comments removed (leave only JSON key/value entries), or
instead convert this file into a .md (documentation) and add a separate valid
cufile_rdma.json without comments; ensure any references to CUFILE_ENV_PATH_JSON
in the content remain unchanged and the final JSON parses cleanly with no //
comment lines.
In `@kv_connectors/llmd_fs_backend/docs/gds.md`:
- Line 8: The link fragment in gds.md uses the wrong anchor
`#tuning-cufile-cufilejson`; update it to match the generated GitHub anchor for
the heading `## Tuning cuFile (`cufile_rdma.json`)` by changing the fragment to
`#tuning-cufile-cufile_rdmajson` (or alternatively rename the heading to remove
the underscore to match the existing link). Locate the incorrect fragment string
in the link and replace it with the corrected anchor
`#tuning-cufile-cufile_rdmajson` (or adjust the heading text `Tuning cuFile
(cufile_rdma.json)`) so the link resolves.
In `@kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py`:
- Line 253: The constructor for the worker class (the __init__ method in
llmd_fs_backend/worker.py) makes gds_mode a required parameter which breaks
callers; change the signature to give gds_mode a default of "disabled" (e.g.,
gds_mode: str = "disabled") so callers are backward compatible and consistent
with the C++ binding default, and update any internal references or docstrings
that assume a provided value if necessary.
---
Outside diff comments:
In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp`:
- Around line 131-142: The calc_staging_bytes function currently ignores the
gds_mode parameter and always computes a staging buffer size; update
StorageOffloadEngine::calc_staging_bytes to return 0 when gds_mode indicates
full-GDS/direct GDS transfer mode (per the header note "0 for full-GDS modes")
and only compute the byte-size loop for non-GDS modes; locate the function by
name and add a conditional branch that checks the appropriate GdsMode enum/flag
for full-GDS before performing the tensor stride*element_size accumulation so
staging memory is not allocated for direct-GDS transfers.
---
Nitpick comments:
In
`@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`:
- Around line 104-110: The CuFileApi constructor currently returns silently when
dlopen fails (m_handle == nullptr); update CuFileApi to call dlerror() and log
or print the returned error string when both dlopen("libcufile.so", RTLD_NOW)
and dlopen("libcufile.so.0", RTLD_NOW) fail so you get diagnostic output about
why the library couldn't be loaded; locate the CuFileApi constructor and add a
diagnostic message referencing dlerror() (and the attempted names "libcufile.so"
/ "libcufile.so.0") before returning so failures are visible at runtime.
In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cpp`:
- Around line 144-149: FileIO::update_atime currently calls utimensat and
ignores its return value; change it to check the return value and, on error
(return < 0), log a DEBUG-level message including the path and strerror(errno)
(or errno) to aid troubleshooting; ensure you include context (function
name/FileIO::update_atime and the path) in the log and keep behavior otherwise
unchanged (no exception thrown) so callers remain unaffected.
- Around line 66-74: The call to ofs.rdbuf()->pubsetbuf(...) occurs after the
std::ofstream is opened which is implementation-defined and may be ignored;
change to set the buffer on the streambuf before opening the file (use a
default-constructed std::ofstream ofs, call
ofs.rdbuf()->pubsetbuf(thread_write_buffer.data(), WRITE_BUFFER_SIZE), then call
ofs.open(tmp_path, std::ios::out | std::ios::binary) and check the stream), or
alternatively use a constructor/approach that ensures the buffer is installed
prior to opening; update the code referencing tmp_path, ofs,
thread_write_buffer, and WRITE_BUFFER_SIZE accordingly.
- Around line 44-45: The thread-local tmp_file_suffix in file_io.cpp uses
std::random_device{}() which can collide across processes; modify construction
of tmp_file_suffix (the thread_local std::string) to include the current process
identifier plus a higher-entropy random value so suffixes are unique across
processes (e.g., concatenate "_" + process id + "_" + a 64-bit RNG value +
".tmp"). Use appropriate API for PID on each platform (getpid() on POSIX or
GetCurrentProcessId() on Windows) or fall back to
std::hash<std::thread::id>()/timestamp as needed, and seed a std::mt19937_64
with std::random_device for the random portion. Ensure the change preserves
thread_local semantics and the variable name tmp_file_suffix remains the same.
In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp`:
- Around line 54-57: The docstring in storage_offload_bindings.cpp concatenates
"read_ratio)" and "gds_mode:" without a newline; update the string literal so
there is a newline (e.g., "\n") before the "gds_mode: GDS operation mode..."
description to separate the paragraphs referring to
read_preferring_workers/read_ratio and gds_mode; locate the multi-line docstring
containing "read_preferring_workers" and "gds_mode" and insert the missing
newline in that literal.
In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json`:
- Line 69: The rdma_dev_addr_list JSON contains placeholder entries like
"<RDMA_NIC_IP_1>" which will break cuFile parsing; add a startup validation in
the GDS initialization path (e.g., the function that performs GDS
initialization/registration such as initGDS/initializeGDS or the routine that
loads cuFile/GDS config) that inspects the rdma_dev_addr_list array and fails
early or logs a clear error if any entry matches the placeholder pattern (e.g.,
contains '<' or matches /<.*>/ or known tokens like "<RDMA_NIC_IP_"),
instructing the user to replace them; make this validation produce a fatal error
or warning with actionable text so misconfigured JSON is detected before cuFile
is invoked.
In `@kv_connectors/llmd_fs_backend/docs/gds.md`:
- Around line 76-78: The markdown example output blocks for GdsFileIO and the
StorageOffloadEngine logs are missing a fenced-code language; update the two
fenced blocks containing "GdsFileIO: GPUDirect Storage (GDS) enabled" and the
block with "StorageOffloadEngine: GDS initialization failed..." /
"StorageOffloadEngine: READ=CPU WRITE=CPU" to use a language identifier (e.g.,
add ```text or ```log) so the code fences include a language for proper linting
and rendering; locate the blocks by the unique text snippets "GdsFileIO:
GPUDirect Storage (GDS) enabled" and "StorageOffloadEngine: GDS initialization
failed, falling back to CPU_BUFFER_STAGE" when making the change.
In `@kv_connectors/llmd_fs_backend/tests/test_gds_backend.py`:
- Around line 30-34: The test currently swallows all errors in the try/except
around the subprocess.run call that sets nvidia_fs_loaded, making failures
silent; update the except block in the same try that invokes
subprocess.run(["lsmod"]) to log the caught exception (include the exception
details/traceback) before setting nvidia_fs_loaded = False — use the project
logger or Python's logging with logger.exception() or logging.error(...,
exc_info=True) so the failure reason is recorded for debugging while preserving
the fallback behavior.
- Around line 60-73: The two functions call _collect_gds_status() twice causing
redundant work; change check_gds_available() to return the collected status dict
(or store it in a module-level cached variable) instead of just a bool, then
update get_gds_status_message() and test_gds_available() to reuse that returned
status (or read the cached value) so _collect_gds_status() is invoked only once;
reference the functions check_gds_available, get_gds_status_message, and
_collect_gds_status when making the changes and adjust test_gds_available() to
accept the new return shape or use the cache.
In `@pkg/kvevents/engineadapter/vllm_adapter.go`:
- Around line 293-316: The toInt function is missing a case for the native Go
int type which causes "unsupported numeric type: int" errors; update the type
switch in toInt (function toInt) to include a case for int that returns int(v),
nil (you can add a comment similar to the uint64 case about overflow not being a
concern) so the function handles int values alongside int8/int16/int32/int64 and
the unsigned variants.
In `@pkg/tokenization/uds_tokenizer.go`:
- Around line 76-78: The warmup() function currently hardcodes a 30*time.Second
context timeout; replace that literal with the mmTimeout constant to keep
timeout values consistent. Locate the context creation in warmup() (e.g.,
context.WithTimeout(..., 30*time.Second)) and change the timeout argument to
mmTimeout so the function uses the shared mmTimeout variable defined near the
top of the file.
- Around line 284-291: The RenderChat call is creating a new context from
context.Background(), ignoring upstream deadlines; change the API to accept a
parent context.Context and derive the timeout from it (use ctx, cancel :=
context.WithTimeout(parentCtx, timeout)), update the Tokenizer interface
signature (Tokenizer.RenderChat) and all callers to pass through the caller's
context, and ensure cancel() is deferred; if changing signatures is out of
scope, add a clear code comment on RenderChat and the Tokenizer interface
describing this limitation and the potential resource leak so it is documented
for future work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 3b385448-d5b4-451c-88f4-383658914f72
📒 Files selected for processing (33)
.gitignorekv_connectors/llmd_fs_backend/Makefilekv_connectors/llmd_fs_backend/README.mdkv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hppkv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cppkv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.hppkv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cppkv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.hppkv_connectors/llmd_fs_backend/csrc/storage/file_io.hppkv_connectors/llmd_fs_backend/csrc/storage/logger.hppkv_connectors/llmd_fs_backend/csrc/storage/storage_handler.hppkv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cppkv_connectors/llmd_fs_backend/csrc/storage/storage_offload.hppkv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cppkv_connectors/llmd_fs_backend/csrc/storage/storage_types.hppkv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hppkv_connectors/llmd_fs_backend/csrc/storage/thread_pool.cppkv_connectors/llmd_fs_backend/docs/cufile_rdma.jsonkv_connectors/llmd_fs_backend/docs/gds.mdkv_connectors/llmd_fs_backend/llmd_fs_backend/spec.pykv_connectors/llmd_fs_backend/llmd_fs_backend/worker.pykv_connectors/llmd_fs_backend/setup.pykv_connectors/llmd_fs_backend/tests/conftest.pykv_connectors/llmd_fs_backend/tests/test_fs_backend.pykv_connectors/llmd_fs_backend/tests/test_gds_backend.pypkg/kvevents/engineadapter/common.gopkg/kvevents/engineadapter/sglang_adapter.gopkg/kvevents/engineadapter/sglang_adapter_test.gopkg/kvevents/engineadapter/vllm_adapter.gopkg/kvevents/engineadapter/vllm_adapter_bench_test.gopkg/kvevents/engineadapter/vllm_adapter_test.gopkg/tokenization/uds_tokenizer.gotests/e2e/uds_tokenizer/uds_e2e_test.go
💤 Files with no reviewable changes (1)
- kv_connectors/llmd_fs_backend/csrc/storage/file_io.hpp
| // File descriptor passed to cuFileHandleRegister (layout must match cufile.h) | ||
| struct CUfileDescr_t { | ||
| CUfileFileHandleType type; | ||
| union { | ||
| int fd; | ||
| void* handle; | ||
| } handle; | ||
| const CUfileFSOps_t* fs_ops; | ||
| }; | ||
|
|
||
| // Opaque handle returned by cuFileHandleRegister | ||
| typedef void* CUfileHandle_t; | ||
|
|
||
| // Driver properties returned by cuFileDriverGetProperties | ||
| struct CUfileDrvProps_t { | ||
| size_t max_device_cache_size; | ||
| size_t max_device_pinned_mem_size; | ||
| char _reserved[256]; | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the struct layout for CUfileDescr_t and CUfileDrvProps_t in NVIDIA cuFile library version 1.x?
💡 Result:
The struct layouts for CUfileDescr_t and CUfileDrvProps_t in NVIDIA cuFile library version 1.x, as documented in official NVIDIA GPUDirect Storage cuFile API references (across CUDA 11.4-11.8 and current GDS v1.16 docs), are as follows: CUfileDescr_t: typedef struct CUfileDescr_t { CUfileFileHandleType type; /* type of file being registered / union { int fd; / Linux */ void handle; / Windows */ } handle; const CUfileFSOps_t fs_ops; / file system operation table / } CUfileDescr_t; CUfileDrvProps_t: typedef struct CUfileDrvProps { struct { unsigned int major_version; unsigned int minor_version; size_t poll_thresh_size; size_t max_direct_io_size; unsigned int dstatusflags; unsigned int dcontrolflags; } nvfs; CUfileFeatureFlags_t fflags; unsigned int max_device_cache_size; unsigned int per_buffer_cache_size; unsigned int max_pinned_memory_size; unsigned int max_batch_io_timeout_msecs; / present in some versions like 11.6 */ } CUfileDrvProps_t; Note: Minor variations exist across exact versions (e.g., max_batch_io_timeout_msecs appears in CUDA 11.6+ docs), but the core layout is consistent. The structure is self-describing per cuFileDriverGetProperties documentation. Layouts match across PDF docs, HTML API guides, and manpages for libcufile-dev 1.5+.
Citations:
- 1: https://docs.nvidia.com/cuda/archive/11.6.1/pdf/cuFile_API.pdf
- 2: https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html
- 3: https://docs.nvidia.com/cuda/archive/11.4.1/pdf/cuFile_API.pdf
- 4: https://docs.nvidia.com/cuda/archive/11.8.0/cufile-api/index.html
- 5: https://docs.nvidia.com/cuda/archive/11.7.1/cufile-api/index.html
- 6: https://docs.nvidia.com/gpudirect-storage/api-reference-guide
Critical ABI mismatch in CUfileDrvProps_t struct definition—will cause memory corruption.
The CUfileDrvProps_t struct defined in this file does not match NVIDIA's official cuFile API definition. According to NVIDIA's cuFile documentation (CUDA 11.4–11.8, GDS v1.16), the actual struct contains:
- A nested
nvfsstruct with version and configuration fields CUfileFeatureFlags_t fflagsunsigned intfield types (notsize_t)- Fields named
max_pinned_memory_sizeandper_buffer_cache_size(notmax_device_pinned_mem_size) - Optional
max_batch_io_timeout_msecsin newer versions
The code's simplified definition with _reserved[256] padding will misalign with the actual library structure, causing memory corruption when cuFileDriverGetProperties() writes to it. Fix struct definition to match NVIDIA's headers exactly, or use the official NVIDIA-provided header directly instead of recreating it. CUfileDescr_t is correctly defined.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/cufile_loader.hpp`
around lines 40 - 58, The CUfileDrvProps_t definition here mismatches the
official NVIDIA cuFile ABI and can corrupt memory when
cuFileDriverGetProperties() populates it; replace the local struct with the
exact definition from NVIDIA's cuFile headers (or include the vendor header
directly) so fields like the nested nvfs struct, CUfileFeatureFlags_t fflags,
unsigned int-typed fields (e.g., max_pinned_memory_size, per_buffer_cache_size),
and any optional max_batch_io_timeout_msecs are present and correctly typed and
ordered; keep CUfileDescr_t as-is but remove or replace the _reserved[256]
placeholder and ensure CUfileDrvProps_t matches the vendor header exactly.
| // Thread-local unique suffix for temporary files (matches file_io.cpp pattern) | ||
| thread_local std::string gds_tmp_suffix = | ||
| "_" + std::to_string(std::random_device{}()) + ".tmp"; |
There was a problem hiding this comment.
Thread-local temp suffix may collide across threads.
std::random_device{}() returns a 32-bit value. With many threads, birthday paradox makes collisions more likely. If two threads get the same suffix and write to the same file_path, they'll use the same temp file, causing data corruption or race conditions.
Include thread ID for uniqueness
-thread_local std::string gds_tmp_suffix =
- "_" + std::to_string(std::random_device{}()) + ".tmp";
+thread_local std::string gds_tmp_suffix =
+ "_" + std::to_string(std::random_device{}()) + "_" +
+ std::to_string(std::hash<std::thread::id>{}(std::this_thread::get_id())) +
+ ".tmp";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`
around lines 32 - 34, The thread-local temporary suffix gds_tmp_suffix is
vulnerable to collisions because it uses only std::random_device{}() (32-bit);
change its generation to include the calling thread's identity and stronger
uniqueness (for example combine std::this_thread::get_id() and a per-thread or
global atomic counter or a larger random value) so each thread produces a
distinct suffix; update the thread_local gds_tmp_suffix initialization (the
symbol gds_tmp_suffix in gds_file_io.cpp) to concatenate thread id and an added
unique token to avoid temp file collisions when writing the same file_path.
| for (size_t i = 0; i < num_chunks; i++) { | ||
| void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size); | ||
|
|
||
| CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr, | ||
| chunk_size, | ||
| CU_FILE_RDMA_REGISTER); | ||
| if (status.err != CU_FILE_SUCCESS) { | ||
| FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block " | ||
| << i << " with error code: " << status.err); | ||
| return false; | ||
| } | ||
|
|
||
| m_registered_buffers[block_ptr] = chunk_size; | ||
| } |
There was a problem hiding this comment.
Buffer overread in chunked registration when size is not a multiple of chunk_size.
When size is not evenly divisible by chunk_size, the final iteration registers chunk_size bytes starting at block_ptr, but only size - (i * chunk_size) bytes remain. This causes cuFileBufRegister to access memory beyond the buffer boundary (undefined behavior).
Fix: clamp the last chunk size
for (size_t i = 0; i < num_chunks; i++) {
void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size);
+ size_t this_chunk_size = std::min(chunk_size, size - (i * chunk_size));
CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr,
- chunk_size,
+ this_chunk_size,
CU_FILE_RDMA_REGISTER);
if (status.err != CU_FILE_SUCCESS) {
FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block "
<< i << " with error code: " << status.err);
return false;
}
- m_registered_buffers[block_ptr] = chunk_size;
+ m_registered_buffers[block_ptr] = this_chunk_size;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (size_t i = 0; i < num_chunks; i++) { | |
| void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size); | |
| CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr, | |
| chunk_size, | |
| CU_FILE_RDMA_REGISTER); | |
| if (status.err != CU_FILE_SUCCESS) { | |
| FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block " | |
| << i << " with error code: " << status.err); | |
| return false; | |
| } | |
| m_registered_buffers[block_ptr] = chunk_size; | |
| } | |
| for (size_t i = 0; i < num_chunks; i++) { | |
| void* block_ptr = static_cast<uint8_t*>(gpu_ptr) + (i * chunk_size); | |
| size_t this_chunk_size = std::min(chunk_size, size - (i * chunk_size)); | |
| CUfileError_t status = m_cufile.cuFileBufRegister(block_ptr, | |
| this_chunk_size, | |
| CU_FILE_RDMA_REGISTER); | |
| if (status.err != CU_FILE_SUCCESS) { | |
| FS_LOG_WARN("GdsFileIO: cuFileBufRegister failed for block " | |
| << i << " with error code: " << status.err); | |
| return false; | |
| } | |
| m_registered_buffers[block_ptr] = this_chunk_size; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`
around lines 215 - 228, The loop that registers GPU buffer chunks (using
num_chunks, chunk_size, size, block_ptr and calling m_cufile.cuFileBufRegister)
can overread when size isn't a multiple of chunk_size; change the registration
to compute remaining = size - (i * chunk_size) and set register_len =
std::min(chunk_size, remaining), call cuFileBufRegister with register_len (not
chunk_size) and store register_len in m_registered_buffers[block_ptr] so the
last block is clamped to the actual remaining bytes.
| // O_RDWR required by cuFile for internal DMA setup even on write-only paths | ||
| int fd = open(tmp_path.c_str(), O_RDWR | O_CREAT | O_DIRECT, 0644); | ||
| if (fd < 0) { | ||
| FS_LOG_ERROR("GdsFileIO: Failed to open temporary file " | ||
| << tmp_path << ": " << std::strerror(errno) | ||
| << " (errno=" << errno << ")"); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if cuFile documentation specifies alignment requirements for O_DIRECT
rg -n -i "align|O_DIRECT" --type=cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 357
🏁 Script executed:
#!/bin/bash
# Read the function context around lines 259-266
sed -n '240,280p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 1624
🏁 Script executed:
#!/bin/bash
# Search for cuFileWrite and cuFileRead calls to see how they use file descriptors
rg -n -A 5 "cuFileWrite|cuFileRead" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/Repository: opendatahub-io/llm-d-kv-cache
Length of output: 7682
🏁 Script executed:
#!/bin/bash
# Check for any alignment-related code or buffers
rg -n -i "alignment|aligned|align" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 55
🏁 Script executed:
#!/bin/bash
# Find where actual_gpu_ptr is defined/set
rg -n "actual_gpu_ptr" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/Repository: opendatahub-io/llm-d-kv-cache
Length of output: 576
🏁 Script executed:
#!/bin/bash
# Check the tensor_copier to see if it guarantees any alignment
rg -n -i "tensor_copier|get_tensors" --type=cpp -A 3 kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/ | head -60Repository: opendatahub-io/llm-d-kv-cache
Length of output: 3556
🏁 Script executed:
#!/bin/bash
# Search for any cuFile documentation or alignment mentions in header files
rg -n -i "align|buffer.*requirement|DMA" --type=hpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/Repository: opendatahub-io/llm-d-kv-cache
Length of output: 100
🏁 Script executed:
#!/bin/bash
# View the actual_gpu_ptr assignment with context
sed -n '285,310p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 1203
🏁 Script executed:
#!/bin/bash
# View the read-side assignment
sed -n '377,405p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 1274
🏁 Script executed:
#!/bin/bash
# Find and examine tensor_copier.hpp
find . -name "tensor_copier.hpp" -o -name "tensor_copier.h" | head -5Repository: opendatahub-io/llm-d-kv-cache
Length of output: 135
🏁 Script executed:
#!/bin/bash
# Search for cuFileBufRegister to see if GPU buffers are registered with alignment info
rg -n "cuFileBufRegister" --type=cpp -B 2 -A 5Repository: opendatahub-io/llm-d-kv-cache
Length of output: 6897
🏁 Script executed:
#!/bin/bash
# View tensor_copier.hpp
cat kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp | head -100Repository: opendatahub-io/llm-d-kv-cache
Length of output: 2196
🏁 Script executed:
#!/bin/bash
# Search for error code handling or EINVAL
rg -n "EINVAL|error.*code|errno.*22" --type=cpp kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/Repository: opendatahub-io/llm-d-kv-cache
Length of output: 1146
🏁 Script executed:
#!/bin/bash
# Look for block_size documentation or constraints
rg -n -B 2 -A 2 "block_size" kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp | head -40Repository: opendatahub-io/llm-d-kv-cache
Length of output: 433
Verify block_size alignment or implement fallback to non-O_DIRECT I/O.
O_DIRECT requires buffers aligned to filesystem block boundaries (512 bytes or 4KB). Line 293 calculates actual_gpu_ptr = tensor.data_ptr() + (gpu_block_idx * block_size). If block_size is not divisible by 512 or 4K, this arithmetic produces misaligned pointers, causing cuFileWrite/cuFileRead to fail with EINVAL on certain filesystems. Additionally, error handling (lines 304-307, 396-399) logs generic error codes without distinguishing alignment failures, preventing graceful degradation. Either assert that block_size meets alignment constraints, or implement fallback to standard I/O when O_DIRECT fails.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`
around lines 259 - 266, The current use of O_DIRECT when opening the temp file
(open(..., O_DIRECT)) can cause EINVAL in cuFileRead/cuFileWrite if
tensor.data_ptr() + (gpu_block_idx * block_size) or block_size is not aligned to
filesystem block boundaries; update the Open+IO path in GdsFileIO so it first
verifies alignment (ensure block_size and computed actual_gpu_ptr are aligned to
512/4096 as required) and if not aligned or if open with O_DIRECT or the cuFile
calls return EINVAL, retry the operation by reopening the file without O_DIRECT
and performing standard buffered I/O as a fallback; add explicit checks around
the arithmetic that computes actual_gpu_ptr (referencing tensor.data_ptr() and
gpu_block_idx) and improve error logging in the cuFileWrite/cuFileRead error
handlers to distinguish alignment errors vs other errno values so the code can
decide to retry without O_DIRECT.
| // StorageHandler interface: Read blocks from file | ||
| bool GdsFileIO::read_blocks_from_file(const std::string& file_path, | ||
| const std::vector<int64_t>& block_ids, | ||
| cudaStream_t stream) { | ||
| // Each ThreadPool thread has its own CUDA stream, but cuFileRead is | ||
| // synchronous and operates directly on the device — no stream needed. | ||
| (void)stream; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how async_load_gpu_blocks callers synchronize
ast-grep --pattern 'async_load_gpu_blocks($$$)'
rg -n -B10 -A5 'async_load_gpu_blocks'Repository: opendatahub-io/llm-d-kv-cache
Length of output: 6845
🏁 Script executed:
# Get the full implementation of async_load_gpu_blocks to understand synchronization
sed -n '286,350p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 2442
🏁 Script executed:
# Also get the write path implementation for comparison
sed -n '210,280p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 3062
🏁 Script executed:
# Check for synchronization calls related to gpu_kvs_ready_event
rg -n 'gpu_kvs_ready_event' kv_connectors/llmd_fs_backend/csrc/storage/Repository: opendatahub-io/llm-d-kv-cache
Length of output: 468
🏁 Script executed:
# Check how read_blocks_from_file is called within async_load_gpu_blocks
rg -n 'read_blocks_from_file' kv_connectors/llmd_fs_backend/csrc/storage/Repository: opendatahub-io/llm-d-kv-cache
Length of output: 1184
🏁 Script executed:
# Check the wait_job and get_finished implementations
sed -n '130,180p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 1659
🏁 Script executed:
# Check how Python caller waits after async_load_gpu_blocks
sed -n '230,260p' kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.pyRepository: opendatahub-io/llm-d-kv-cache
Length of output: 1128
🏁 Script executed:
# Check get_finished implementation
rg -n 'get_finished' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp -A10Repository: opendatahub-io/llm-d-kv-cache
Length of output: 439
🏁 Script executed:
# Check if ThreadPool's TLS stream is related to CUDA default stream
rg -n 'get_tls_stream\|ThreadPool' kv_connectors/llmd_fs_backend/csrc/storage/ -A5 | head -80Repository: opendatahub-io/llm-d-kv-cache
Length of output: 55
🏁 Script executed:
# Get more context on wait_job implementation
sed -n '170,200p' kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 802
🏁 Script executed:
# Check GdsFileIO read_blocks_from_file full implementation
sed -n '340,400p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 2520
🏁 Script executed:
# Check if there's stream synchronization after async jobs
rg -n 'cudaStreamSynchronize\|cudaStreamWaitEvent' kv_connectors/llmd_fs_backend/csrc/storage/Repository: opendatahub-io/llm-d-kv-cache
Length of output: 55
🏁 Script executed:
# Check how the loaded blocks are accessed in Python after recording job
rg -n '_record_job\|wait_job\|get_finished' kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py -B3 -A10Repository: opendatahub-io/llm-d-kv-cache
Length of output: 55
🏁 Script executed:
# Check the end of read_blocks_from_file to see full implementation
sed -n '390,430p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 1463
🏁 Script executed:
# Check FileIO::read_blocks_from_file for comparison (non-GDS variant)
sed -n '190,230p' kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_io/file_io.cppRepository: opendatahub-io/llm-d-kv-cache
Length of output: 1348
🏁 Script executed:
# Check what happens after _record_job is called - trace the control flow
grep -n '_record_job\|wait_job\|get_finished' kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.pyRepository: opendatahub-io/llm-d-kv-cache
Length of output: 404
🏁 Script executed:
# Check if there's any explicit synchronization at the scheduler level
rg -n 'wait_job\|get_finished' kv_connectors/llmd_fs_backend/llmd_fs_backend/ -B5 -A5Repository: opendatahub-io/llm-d-kv-cache
Length of output: 55
🏁 Script executed:
# Check scheduler/high-level usage to see when buffers become accessible
rg -n 'get_finished\|wait_job' kv_connectors/llmd_fs_backend/llmd_fs_backend/ -B10 -A10Repository: opendatahub-io/llm-d-kv-cache
Length of output: 55
🏁 Script executed:
# Check if there's any CUDA synchronization between wait_job and actual GPU access
rg -n 'torch.cuda\|cuda_stream\|synchronize' kv_connectors/llmd_fs_backend/llmd_fs_backend/Repository: opendatahub-io/llm-d-kv-cache
Length of output: 55
🏁 Script executed:
# Compare the two read implementations side by side - check for stream sync
rg -n 'cudaStreamSynchronize' kv_connectors/llmd_fs_backend/csrc/storage/backends/Repository: opendatahub-io/llm-d-kv-cache
Length of output: 615
GdsFileIO read path ignores CUDA stream and lacks GPU synchronization, creating potential data race with concurrent GPU access.
GdsFileIO::read_blocks_from_file (line 340-346) discards the stream parameter with (void)stream; and performs cuFileRead directly without stream synchronization. This contrasts with the write path (async_store_gpu_blocks), which waits on gpu_kvs_ready_event before write operations start, and with FileIO::read_blocks_from_file, which calls cudaStreamSynchronize(stream) after read completes (line 217). Since cuFileRead is synchronous but not stream-aware, concurrent GPU kernels reading from the target buffers during or shortly after the read operation will access stale or partially-written data, risking corruption.
Add pre-read synchronization via cudaStreamWaitEvent (like the write path) or post-read cudaStreamSynchronize (like FileIO) to ensure GPU work on the buffers completes before the read begins.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kv_connectors/llmd_fs_backend/csrc/storage/backends/fs_gds/gds_file_io.cpp`
around lines 340 - 346, GdsFileIO::read_blocks_from_file currently ignores the
provided cudaStream_t and performs cuFileRead without GPU synchronization;
update this function to synchronize GPU work on the target device buffers
before/after the cuFileRead (mirror the write path and FileIO behavior) by
calling cudaStreamWaitEvent(stream, gpu_kvs_ready_event) before issuing the read
or calling cudaStreamSynchronize(stream) after the read completes (or both if
needed) to prevent concurrent kernel access to partially-written data; reference
GdsFileIO::read_blocks_from_file, gpu_kvs_ready_event, cudaStreamWaitEvent,
cudaStreamSynchronize, async_store_gpu_blocks, and FileIO::read_blocks_from_file
when making the change.
| // Execute the read operation using polymorphic storage handler | ||
| try { | ||
| // Stage 1: Read file to staging CPU tensor. | ||
| // Read data from disk into a tensor. | ||
| success = TIME_EXPR("read phase 1: read_buffer_from_file", | ||
| read_buffer_from_file(src_file, buf), | ||
| "file:", | ||
| src_file); | ||
| size_t total_size = | ||
| block_ids.size() * m_tensor_copier.get_block_size(); | ||
| success = TIME_EXPR_THROUGHPUT( | ||
| "read: storage handler", | ||
| m_read_handler->read_blocks_from_file(src_file, | ||
| block_ids, | ||
| tls_stream.stream()), | ||
| total_size, | ||
| "file:", | ||
| src_file, | ||
| " blocks:", | ||
| block_ids.size()); | ||
| if (!success) { | ||
| FS_LOG_ERROR("Stage1 read_buffer_from_file failed for " | ||
| << src_file); | ||
| return success; | ||
| } | ||
| // Stage 2: copy tensors from staging CPU tensor to GPU. | ||
| // Perform asynchronous GPU copy and tensor swap. | ||
| auto* cpu_base = static_cast<uint8_t*>(buf.ptr); | ||
| bool is_store = false; | ||
| // Execute the copy operation | ||
| success = TIME_EXPR( | ||
| "read phase 2: copy_cpu_tensor_to_gpu_tensors", | ||
| m_tensor_copier.copy_blocks(cpu_base, block_ids, is_store), | ||
| "file: ", | ||
| src_file); | ||
|
|
||
| auto& tls_stream = ThreadPool::get_tls_stream(); | ||
| cudaError_t err = cudaStreamSynchronize(tls_stream.stream()); | ||
| if (err != cudaSuccess) { | ||
| FS_LOG_ERROR( | ||
| "cudaStreamSynchronize failed: " << cudaGetErrorString(err)); | ||
| return false; | ||
| FS_LOG_ERROR("Load failed for " << src_file); | ||
| } | ||
| return success; |
There was a problem hiding this comment.
Missing atime update after GDS read operations breaks LRU eviction.
The write path calls FileIO::update_atime() when a file already exists (line 231), but the read path never updates atime. When m_read_handler is a GdsFileIO instance, files read via GDS won't have their access times updated, breaking LRU cache eviction logic that depends on atime. As noted in relevant code snippet from storage_offload.cpp:313-321, no update_atime() call follows the read operation.
Add atime update after successful read
success = TIME_EXPR_THROUGHPUT(
"read: storage handler",
m_read_handler->read_blocks_from_file(src_file,
block_ids,
tls_stream.stream()),
total_size,
"file:",
src_file,
" blocks:",
block_ids.size());
if (!success) {
FS_LOG_ERROR("Load failed for " << src_file);
+ } else {
+ FileIO::update_atime(src_file);
}
return success;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp` around lines
310 - 327, The read path in storage_offload.cpp does not update access time
after a successful GDS read, which breaks LRU eviction; after a successful
TIME_EXPR_THROUGHPUT call that sets success from
m_read_handler->read_blocks_from_file (the block read branch in the function
where src_file and block_ids are used), call the FileIO::update_atime for the
same src_file (or invoke the appropriate update_atime method on the file IO
implementation such as GdsFileIO) before returning; ensure the update is
performed only when success is true and handle/log any update failures similarly
to the write path's update_atime usage so atime is kept in sync for LRU
eviction.
| { | ||
| // Optimized cuFile configuration for RDMA-based storage (NVMe-oF, BeeGFS, Lustre, GPFS, Weka). | ||
| // | ||
| // REQUIRED: Set rdma_dev_addr_list to the IP addresses of your RDMA NICs. | ||
| // To find them: `ibdev2netdev` or `ip addr show` (look for your RDMA interface IPs). | ||
| // | ||
| // Usage: | ||
| // export CUFILE_ENV_PATH_JSON=/path/to/cufile_rdma.json | ||
| // | ||
| // NOTE: Application can override custom configuration via export CUFILE_ENV_PATH_JSON=<filepath> | ||
| // e.g : export CUFILE_ENV_PATH_JSON="/home/<xxx>/cufile_rdma.json" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does NVIDIA cuFile configuration support JSON with comments?
💡 Result:
No, NVIDIA cuFile configuration does not support JSON with comments. Official NVIDIA documentation for cuFile shows the /etc/cufile.json configuration file using standard JSON syntax with inline comments in example code (e.g., result 1: "// specify list of file-systems"), but these appear as documentation artifacts rather than parsed comments. Standard JSON parsers reject comments, and no cuFile-specific documentation or source evidence indicates use of a JSONC (JSON with comments) parser. RAPIDS cuDF JSON reader explicitly lacks comment support (result 16), reinforcing standard JSON expectations in NVIDIA ecosystem tools. cuFile is a proprietary closed-source library; no public source confirms custom parsing.
Citations:
- 1: https://docs.nvidia.com/cuda/archive/11.6.1/pdf/cuFile_API.pdf
- 2: https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html
- 3: https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html.md
- 4: https://docs.nvidia.com/cuda/archive/11.4.3/cufile-api/index.html
- 5: https://docs.nvidia.com/cuda/archive/11.8.0/cufile-api/index.html
- 6: [FEA] JSON reader: ignores Java/C++ style comment rapidsai/cudf#10265
- 7: https://docs.nvidia.com/cuda/archive/11.4.0/gds/configuration-guide/topics/gds-parameters.html
Remove // comments—cuFile requires standard JSON and will fail to parse this configuration.
NVIDIA cuFile does not support JSON with comments. This template will generate parse errors when users attempt to use it. Either:
- Remove all
//comments and keep as.json, or - Provide
cufile_rdma.json(valid JSON, no comments) alongside this file as.mddocumentation.
Users cannot directly use this as CUFILE_ENV_PATH_JSON without editing.
🧰 Tools
🪛 Biome (2.4.9)
[error] 2-2: Expected a property but instead found '// Optimized cuFile configuration for RDMA-based storage (NVMe-oF, BeeGFS, Lustre, GPFS, Weka).'.
(parse)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kv_connectors/llmd_fs_backend/docs/cufile_rdma.json` around lines 1 - 12, The
cufile_rdma.json template contains // comments which make it invalid JSON for
cuFile; update the file cufile_rdma.json so it is strict JSON with all comments
removed (leave only JSON key/value entries), or instead convert this file into a
.md (documentation) and add a separate valid cufile_rdma.json without comments;
ensure any references to CUFILE_ENV_PATH_JSON in the content remain unchanged
and the final JSON parses cleanly with no // comment lines.
|
|
||
| GDS uses the **cuFile** library by default for all GPU↔storage transfers. cuFile behavior (I/O threads, | ||
| bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a `cufile_rdma.json` | ||
| config file — see [Tuning cuFile](#tuning-cufile-cufilejson) for an example. |
There was a problem hiding this comment.
Broken link fragment: heading anchor mismatch.
The link #tuning-cufile-cufilejson won't resolve to the heading at line 93 (## Tuning cuFile (\cufile_rdma.json`)). GitHub generates the anchor as #tuning-cufile-cufile_rdmajson` (with underscores from the filename).
Proposed fix
-bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a `cufile_rdma.json`
-config file — see [Tuning cuFile](`#tuning-cufile-cufilejson`) for an example.
+bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a `cufile_rdma.json`
+config file — see [Tuning cuFile](`#tuning-cufile-cufile_rdmajson`) for an example.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| config file — see [Tuning cuFile](#tuning-cufile-cufilejson) for an example. | |
| config file — see [Tuning cuFile](`#tuning-cufile-cufile_rdmajson`) for an example. |
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 8-8: Link fragments should be valid
(MD051, link-fragments)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kv_connectors/llmd_fs_backend/docs/gds.md` at line 8, The link fragment in
gds.md uses the wrong anchor `#tuning-cufile-cufilejson`; update it to match the
generated GitHub anchor for the heading `## Tuning cuFile (`cufile_rdma.json`)`
by changing the fragment to `#tuning-cufile-cufile_rdmajson` (or alternatively
rename the heading to remove the underscore to match the existing link). Locate
the incorrect fragment string in the link and replace it with the corrected
anchor `#tuning-cufile-cufile_rdmajson` (or adjust the heading text `Tuning
cuFile (cufile_rdma.json)`) so the link resolves.
| gpu_block_size: int, | ||
| gpu_blocks_per_file: int, | ||
| threads_per_gpu: int, | ||
| gds_mode: str, |
There was a problem hiding this comment.
Missing default value for gds_mode parameter.
The constructor lacks a default value for gds_mode, making it a required argument. This is a breaking change for any existing callers. Consider adding gds_mode: str = "disabled" for backward compatibility, consistent with the C++ binding default.
Proposed fix
def __init__(
self,
kv_caches: dict[str, torch.Tensor],
attn_backends: dict[str, type[AttentionBackend]],
file_mapper: FileMapper,
gpu_block_size: int,
gpu_blocks_per_file: int,
threads_per_gpu: int,
- gds_mode: str,
+ gds_mode: str = "disabled",
max_staging_memory_gb: int = DEFAULT_MAX_STAGING_MEMORY_GB,
read_preferring_ratio: float = DEFAULT_READ_PREFERRING_WORKERS_RATIO,
):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| gds_mode: str, | |
| gds_mode: str = "disabled", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py` at line 253, The
constructor for the worker class (the __init__ method in
llmd_fs_backend/worker.py) makes gds_mode a required parameter which breaks
callers; change the signature to give gds_mode a default of "disabled" (e.g.,
gds_mode: str = "disabled") so callers are backward compatible and consistent
with the C++ binding default, and update any internal references or docstrings
that assume a provided value if necessary.
Syncs llm-d/llm-d-kv-cache main branch into ODH main_2 branch.
Upstream commit: llm-d@1c435cb
this maps to https://github.com/llm-d/llm-d-kv-cache/releases/tag/v0.7.0
Summary by CodeRabbit
Release Notes
New Features
Improvements
Documentation
Tests