Skip to content

Conversation

@maifeeulasad
Copy link

What does this PR do?

Fixes #299

This PR adds support for GDS (GPU direct access) for pytorch for now. If I get positive feedback, definetly I am willing to integrate in other modules as well.

Open for making changes based on review.

 - try to load with GDS if not mentioned explicitly
 - flag to control load with gds
 - load in dynamic chunk for gds
@maifeeulasad
Copy link
Author

also could be related to: #528

Adding this would be great support for mid range GPUs. Just slap in a SSD, and you are okay to go. Definetly not good, but it works.


# GDS (GPU Direct Storage) support
try:
import cupy as cp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's yet another library, we can't have another dependency here. Even if it was a core dependency of torch we wouldn't access it directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood the architecture, moved GDS related code in rust code with FFI calling with libc

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole code looks honestly pretty bad (looks like AI slop to me tbh, sorry if you handcrafted this).

  • We need benchmarks to showcase how it differs. Given DirectGPU support, I think being VERY precise in the environment the thing was tested on is very important. GDS in previous tests I have made was not really bringing much to the table. Hardware/cloud vendors are usually making it hard to use those features efficiently.
  • Remove the public surface. GDS, just like memory mapping should be an implementation detail, not something users should care about. If it's not OBVIOUSLY faster, then we shouldn't use it
  • The whole things still goes through CPU, it's negating the entire benefit of using GDS.

@loqs
Copy link

loqs commented Dec 3, 2025

@johnnynunez would anyone from Nvidia be willing to help on this. It would be particularly useful to the Spark with its current mmap issue but I think it has performance benefits beyond that use case and safetensors is more widely adopted than fastsafetensors that already supports GDS but does not work with distributed vllm.

@maifeeulasad
Copy link
Author

gpu direct storage nearly closes the gap in safetensors

I will update this branch with my updated code and testing and benchmark and everything I have found so far. Thanks!

@frauttauteffasu
Copy link

Is the benchmark provided which shows GDS as always slower on the code as of bc580db? It might make sense to also provide the GPU used, versions of involved Nvidia components, the benchmark code, the command used and where to obtain the .safetensors file used.

Have you benchmarked with and without cupy use?

Copilot AI review requested due to automatic review settings December 12, 2025 09:17
@maifeeulasad maifeeulasad changed the title nvidia gds support for pytorch | can extend to other nvidia gds support Dec 12, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds GPU Direct Storage (GDS) support for PyTorch in safetensors, enabling zero-copy data transfer directly from NVMe storage to GPU memory, bypassing the CPU. This feature is currently Linux-only and requires NVIDIA's cuFile library (libcufile.so).

Key Changes:

  • Rust FFI bindings to NVIDIA's cuFile library for GDS operations
  • New storage backend (CudaGds) that performs direct NVMe-to-GPU transfers
  • Python API extension with use_gds parameter for safe_open and _safe_open_handle

Reviewed changes

Copilot reviewed 10 out of 13 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
bindings/python/tests/test_gds.py Comprehensive test suite for GDS functionality with various tensor sizes and data types
bindings/python/src/lib.rs Integration of GDS storage backend with validation and tensor loading logic
bindings/python/src/gds/storage.rs GDS storage implementation with read-to-device functionality
bindings/python/src/gds/mod.rs Module organization for GDS components
bindings/python/src/gds/handle.rs cuFile handle management with RAII semantics
bindings/python/src/gds/error.rs Error types for GDS operations
bindings/python/src/gds/driver.rs cuFile driver lifecycle management with singleton pattern
bindings/python/src/gds/bindings.rs Low-level FFI bindings to libcufile.so
bindings/python/py_src/safetensors/torch.py Whitespace-only formatting fix
bindings/python/py_src/safetensors/init.py Whitespace-only formatting fix
bindings/python/benches/test_gds.py Performance benchmarking suite comparing GDS vs standard loading
bindings/python/Cargo.toml Added libc dependency for FFI types
.gitignore Added cufile.log to ignored files
Comments suppressed due to low confidence (1)

bindings/python/tests/test_gds.py:5

  • Import of 'np' is not used.
import numpy as np

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +63 to +65
if gpu_ptr.is_null() {
return Err(GdsError::InvalidFileDescriptor);
}
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning InvalidFileDescriptor error when gpu_ptr is null is misleading. A null GPU pointer is not related to a file descriptor issue. Consider using a more appropriate error variant like InvalidPointer or adding a new NullGpuPointer error variant to better describe the actual problem.

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +62
pub unsafe fn read_to_device(
&self,
gpu_ptr: *mut std::ffi::c_void,
size: usize,
file_offset: usize,
) -> Result<usize, GdsError> {
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety documentation mentions that "The file has at least file_offset + size bytes" as a caller requirement, but this is actually validated by the function at lines 68-70. If the function performs this validation, it shouldn't be listed as a caller requirement in the safety documentation. Either remove the validation (making it truly unsafe) or update the documentation to only list the actual unsafe requirements (valid GPU pointer and buffer size).

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +95
/// Returns number of bytes read on success, negative on error
pub fn cuFileRead(
handle: CUfileHandle_t,
buf: *mut c_void,
size: c_size_t,
file_offset: c_longlong,
dev_offset: c_longlong,
) -> isize;
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation comment should clarify the behavior when a negative value is returned. Based on the usage in storage.rs, negative return values indicate errors, but the comment only mentions "number of bytes read on success, negative on error" without specifying what the negative value represents (error code, -errno, etc.).

Copilot uses AI. Check for mistakes.
Comment on lines +896 to +906
let shape_vec = info.shape.to_vec();
let shape: PyObject = shape_vec.clone().into_pyobject(py)?.into();

// Create tensor with correct dtype and device
let dtype_clone = dtype.clone_ref(py);
let kwargs = [
(intern!(py, "dtype"), dtype_clone),
(intern!(py, "device"), device),
].into_py_dict(py)?;

let shape_for_call: PyObject = shape_vec.into_pyobject(py)?.into();
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shape is being converted to PyObject twice: once at line 897 and again at line 906 (shape_for_call). This creates unnecessary duplication and the first conversion stored in 'shape' is only used later at line 925. Consider removing the first conversion and reusing shape_for_call, or clarifying why both are needed.

Copilot uses AI. Check for mistakes.
Comment on lines +924 to +925
// Reshape if needed
let tensor = tensor.getattr(intern!(py, "reshape"))?.call1((shape,))?;
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling reshape after the GDS read may be unnecessary and could cause issues. The tensor is already created with the correct shape at line 907 using torch.empty(shape). The reshape call could potentially create a view or copy, which defeats the purpose of zero-copy GDS loading. Verify if reshape is actually needed here, or if the tensor already has the correct shape from creation.

Suggested change
// Reshape if needed
let tensor = tensor.getattr(intern!(py, "reshape"))?.call1((shape,))?;
// Tensor already has correct shape; no reshape needed

Copilot uses AI. Check for mistakes.
for _ in range(iterations):
start = time.perf_counter()
with safe_open(filename, framework="pt", device="cuda:0", use_gds=False) as f:
tensor = f.get_tensor(tensor_name)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable tensor is not used.

Copilot uses AI. Check for mistakes.
for _ in range(iterations):
start = time.perf_counter()
with safe_open(filename, framework="pt", device="cuda:0", use_gds=True) as f:
tensor = f.get_tensor(tensor_name)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable tensor is not used.

Copilot uses AI. Check for mistakes.
# Warmup
for _ in range(warmup):
with safe_open(filename, framework="pt", device="cuda:0", use_gds=False) as f:
tensor = f.get_tensor(tensor_name)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'tensor' is unnecessary as it is redefined before this value is used.

Suggested change
tensor = f.get_tensor(tensor_name)
f.get_tensor(tensor_name)

Copilot uses AI. Check for mistakes.
# Warmup
for _ in range(warmup):
with safe_open(filename, framework="pt", device="cuda:0", use_gds=True) as f:
tensor = f.get_tensor(tensor_name)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'tensor' is unnecessary as it is redefined before this value is used.

Suggested change
tensor = f.get_tensor(tensor_name)
f.get_tensor(tensor_name)

Copilot uses AI. Check for mistakes.
import os
import tempfile
import time
from typing import Dict, List, Tuple
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'List' is not used.

Suggested change
from typing import Dict, List, Tuple
from typing import Dict, Tuple

Copilot uses AI. Check for mistakes.
@maifeeulasad
Copy link
Author

The benchmarking and testing notebook can be found here, just run all: https://gist.github.com/maifeeulasad/741df3cf99a9970ce7a4897d5583b223

Kaggle uses loopback FS or tmpfs or overlay FS or container-mounted ephemeral disk. But gds requires the following:

  • XFS and EXT4 filesystem in ordered mode on NVMe/NVMeOF/ScaleFlux CSD devices.
  • NFS over RDMA with MOFED 5.1 and above
  • RDMA capable distributed filesystems like DDN Exascaler, WekaFS, and VAST.
  • ScaleFlux Computational storage

So I wasn't able to test in Kaggle. Haven't tested on Colab yet, but I think the situation would be pretty much same. And most of the data centers still uses HDDs.

Based on what I have implemented so far, and tested so far,

  • GDS performs really well when the tensor size is big
  • It also performs really well when block the RAM with another script.
  • It also performs really well if we put lots of weight of CPU

So I see usage of GDS in real life scenario. Where loads are quite noisy, specially when we are dealing with lots of models and application.

I will try to share more information soon. Open for discussion.

ref:

@frauttauteffasu
Copy link

This adds cufile as a hard dependency, failing to build on any platform without it:

         Compiling safetensors v0.7.0-dev.0 (/tmp/safetensors/safetensors)
      warning: unused import: `driver::GdsDriver`
        --> src/gds/mod.rs:18:9
         |
      18 | pub use driver::GdsDriver;
         |         ^^^^^^^^^^^^^^^^^
         |
         = note: `#[warn(unused_imports)]` on by default

      warning: unused import: `error::GdsError`
        --> src/gds/mod.rs:20:9
         |
      20 | pub use error::GdsError;
         |         ^^^^^^^^^^^^^^^

      warning: unused import: `handle::GdsHandle`
        --> src/gds/mod.rs:22:9
         |
      22 | pub use handle::GdsHandle;
         |         ^^^^^^^^^^^^^^^^^

      warning: unused variable: `result`
        --> src/gds/handle.rs:78:21
         |
      78 |                 let result = cuFileHandleDeregister(self.handle);
         |                     ^^^^^^ help: if this is intentional, prefix it
      with an underscore: `_result`
         |
         = note: `#[warn(unused_variables)]` on by default

      warning: method `is_initialized` is never used
        --> src/gds/driver.rs:64:12
         |
      27 | impl GdsDriver {
         | -------------- method in this implementation
      ...
      64 |     pub fn is_initialized(&self) -> bool {
         |            ^^^^^^^^^^^^^^
         |
         = note: `#[warn(dead_code)]` on by default

      warning: method `fd` is never used
        --> src/gds/handle.rs:62:12
         |
      21 | impl GdsHandle {
         | -------------- method in this implementation
      ...
      62 |     pub fn fd(&self) -> i32 {
         |            ^^

      warning: field `path` is never read
        --> src/gds/storage.rs:11:5
         |
      9  | pub struct GdsStorage {
         |            ---------- field in this struct
      10 |     handle: GdsHandle,
      11 |     path: PathBuf,
         |     ^^^^

      warning: methods `path` and `size` are never used
        --> src/gds/storage.rs:36:12
         |
      15 | impl GdsStorage {
         | --------------- methods in this implementation
      ...
      36 |     pub fn path(&self) -> &PathBuf {
         |            ^^^^
      ...
      41 |     pub fn size(&self) -> usize {
         |            ^^^^

      warning: creating a shared reference to mutable static
        --> src/gds/driver.rs:44:13
         |
      44 | /             DRIVER_INSTANCE
      45 | |                 .clone()
         | |________________________^ shared reference to mutable static
         |
         = note: for more information, see
      <https://doc.rust-lang.org/nightly/edition-guide/rust-2024/static-mut-references.html>
         = note: shared references to mutable statics are dangerous; it's
      undefined behavior if the static is mutated or if a mutable reference is
      created for it while the shared reference lives
         = note: `#[warn(static_mut_refs)]` on by default

      error: linking with `cc` failed: exit status: 1
        |
        = note:  "cc" "-Wl,--version-script=/tmp/rustcO36qMS/list"
      "-Wl,--no-undefined-version" "-m64" "/tmp/rustcO36qMS/symbols.o"
      "<17 object files omitted>" "-Wl,--as-needed"
      "-Wl,-Bdynamic" "-lcufile" "-Wl,-Bstatic"
      "/tmp/safetensors/bindings/python/target/release/deps/{libsafetensors-6d7dcccfb74f4666.rlib,libserde_json-f8b6f585bf5df12b.rlib,libmemchr-792e9d77b6c57362.rlib,libitoa-eb11a3c454c526bf.rlib,libryu-e987ae88e9011779.rlib,libserde-737e8e7c0e87e24c.rlib,libserde_core-31036d4ca4b1a638.rlib,libpyo3-656dbc47a26c5839.rlib,libmemoffset-7d7bab1c5fa168b5.rlib,libonce_cell-1a293e670803d6dd.rlib,libpyo3_ffi-116dfab57aef71bd.rlib,libunindent-4f30b6354b8a919b.rlib,libmemmap2-846ee8a68c9859d8.rlib,liblibc-994a2b65212047a6.rlib}.rlib"
      "<sysroot>/lib/rustlib/x86_64-unknown-linux-gnu/lib/{libstd-*,libpanic_unwind-*,libobject-*,libmemchr-*,libaddr2line-*,libgimli-*,librustc_demangle-*,libstd_detect-*,libhashbrown-*,librustc_std_workspace_alloc-*,libminiz_oxide-*,libadler2-*,libunwind-*,libcfg_if-*,liblibc-*,librustc_std_workspace_core-*,liballoc-*,libcore-*,libcompiler_builtins-*}.rlib"
      "-Wl,-Bdynamic" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl"
      "-lc" "-L" "/tmp/rustcO36qMS/raw-dylibs" "-Wl,--eh-frame-hdr"
      "-Wl,-z,noexecstack" "-L" "<sysroot>/local/cuda/lib64" "-L"
      "<sysroot>/local/cuda-12/lib64" "-L" "<sysroot>/local/cuda-13.0/lib64"
      "-L" "<sysroot>/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o"
      "/tmp/safetensors/bindings/python/target/release/deps/libsafetensors_rust.so"
      "-Wl,--gc-sections" "-shared" "-Wl,-z,relro,-z,now" "-Wl,-O1"
      "-Wl,--strip-debug" "-nodefaultlibs"
        = note: some arguments are omitted. use `--verbose` to show all linker
      arguments
        = note: /usr/bin/ld: cannot find -lcufile: No such file or directory
                collect2: error: ld returned 1 exit status
      

      warning: `safetensors-python` (lib) generated 9 warnings
      error: could not compile `safetensors-python` (lib) due to 1 previous
      error; 9 warnings emitted
      💥 maturin failed

@maifeeulasad
Copy link
Author

Excellent edge case, now I have added a feature as rail guard to prevent leaking GDS code into regular build. Please check. So GDS features need to build explicitly, which is even better, as per my understanding.

Updated notebook can be found in the previous url, here goes the url for you convinience: https://gist.github.com/maifeeulasad/741df3cf99a9970ce7a4897d5583b223#file-safetensors-benchmarking-withand-without-gds-ipynb

Open for further review, discussion, exploration.

@maifeeulasad maifeeulasad requested a review from Narsil December 12, 2025 17:25
@frauttauteffasu
Copy link

Please explain why cupy is required. What limitations are there in the pytorch API that makes it impossible to implement without adding a dependency on cupy?

Comment on lines +12 to +14
[features]
default = []
cuda-gds = []
Copy link

@frauttauteffasu frauttauteffasu Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is not exposed in the python binding, it forces the user to manually invoke maturin which contradicts the README.

@phaserblast
Copy link

I don't think I understand, is the GDS reader implemented in Python? What if I want to use safetensors from a rust or C project? Shouldn't the reader be implemented in native rust and only have bindings for Python?

@frauttauteffasu
Copy link

The GDS reader is implemented in Rust with Python bindings. It however currently requires calling maturin to build the Rust GDS then install the wheel maturin built as there is no python option to trigger that. So the README instructions do not work. It is also unclear how this would work with a prebuilt package hosted on pypi.

$ cd safetensors/bindings/python/
$ maturin build
$ python -m pip install target/wheels/*.whl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Any plan to support Nvidia GPUDirect Storage?

5 participants