nvidia gds support #676

maifeeulasad · 2025-11-22T05:20:07Z

What does this PR do?

Fixes #299

This PR adds support for GDS (GPU direct access) for pytorch for now. If I get positive feedback, definetly I am willing to integrate in other modules as well.

Open for making changes based on review.

- try to load with GDS if not mentioned explicitly - flag to control load with gds - load in dynamic chunk for gds

maifeeulasad · 2025-11-22T05:23:21Z

also could be related to: #528

Adding this would be great support for mid range GPUs. Just slap in a SSD, and you are okay to go. Definetly not good, but it works.

bindings/python/py_src/safetensors/__init__.py

Narsil · 2025-11-23T17:00:54Z

bindings/python/py_src/safetensors/torch.py


+# GDS (GPU Direct Storage) support
+try:
+    import cupy as cp


That's yet another library, we can't have another dependency here. Even if it was a core dependency of torch we wouldn't access it directly.

Understood the architecture, moved GDS related code in rust code with FFI calling with libc

Narsil

The whole code looks honestly pretty bad (looks like AI slop to me tbh, sorry if you handcrafted this).

We need benchmarks to showcase how it differs. Given DirectGPU support, I think being VERY precise in the environment the thing was tested on is very important. GDS in previous tests I have made was not really bringing much to the table. Hardware/cloud vendors are usually making it hard to use those features efficiently.
Remove the public surface. GDS, just like memory mapping should be an implementation detail, not something users should care about. If it's not OBVIOUSLY faster, then we shouldn't use it
The whole things still goes through CPU, it's negating the entire benefit of using GDS.

loqs · 2025-12-03T18:13:16Z

@johnnynunez would anyone from Nvidia be willing to help on this. It would be particularly useful to the Spark with its current mmap issue but I think it has performance benefits beyond that use case and safetensors is more widely adopted than fastsafetensors that already supports GDS but does not work with distributed vllm.

maifeeulasad · 2025-12-11T16:45:15Z

gpu direct storage nearly closes the gap in safetensors

I will update this branch with my updated code and testing and benchmark and everything I have found so far. Thanks!

frauttauteffasu · 2025-12-11T20:50:34Z

Is the benchmark provided which shows GDS as always slower on the code as of bc580db? It might make sense to also provide the GPU used, versions of involved Nvidia components, the benchmark code, the command used and where to obtain the .safetensors file used.

Have you benchmarked with and without cupy use?

- https://github.com/rust-lang/libc

- https://en.wikipedia.org/wiki/Resource_acquisition_is_initialization

- https://github.com/huggingface/safetensors/pull/676/files#r2554199302 - https://github.com/huggingface/safetensors/pull/676/files#r2554204697

- multiple loading - preserving shape and data - failing without cuda gpu, like cpu and other gpu - different type

- bigger size performs better - helps during cpu is quite busy

Copilot

Pull request overview

This PR adds GPU Direct Storage (GDS) support for PyTorch in safetensors, enabling zero-copy data transfer directly from NVMe storage to GPU memory, bypassing the CPU. This feature is currently Linux-only and requires NVIDIA's cuFile library (libcufile.so).

Key Changes:

Rust FFI bindings to NVIDIA's cuFile library for GDS operations
New storage backend (CudaGds) that performs direct NVMe-to-GPU transfers
Python API extension with use_gds parameter for safe_open and _safe_open_handle

Reviewed changes

Copilot reviewed 10 out of 13 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
bindings/python/tests/test_gds.py	Comprehensive test suite for GDS functionality with various tensor sizes and data types
bindings/python/src/lib.rs	Integration of GDS storage backend with validation and tensor loading logic
bindings/python/src/gds/storage.rs	GDS storage implementation with read-to-device functionality
bindings/python/src/gds/mod.rs	Module organization for GDS components
bindings/python/src/gds/handle.rs	cuFile handle management with RAII semantics
bindings/python/src/gds/error.rs	Error types for GDS operations
bindings/python/src/gds/driver.rs	cuFile driver lifecycle management with singleton pattern
bindings/python/src/gds/bindings.rs	Low-level FFI bindings to libcufile.so
bindings/python/py_src/safetensors/torch.py	Whitespace-only formatting fix
bindings/python/py_src/safetensors/init.py	Whitespace-only formatting fix
bindings/python/benches/test_gds.py	Performance benchmarking suite comparing GDS vs standard loading
bindings/python/Cargo.toml	Added libc dependency for FFI types
.gitignore	Added cufile.log to ignored files

Comments suppressed due to low confidence (1)

bindings/python/tests/test_gds.py:5

Import of 'np' is not used.

import numpy as np

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-12T09:24:01Z

bindings/python/src/gds/storage.rs

+        if gpu_ptr.is_null() {
+            return Err(GdsError::InvalidFileDescriptor);
+        }


Returning InvalidFileDescriptor error when gpu_ptr is null is misleading. A null GPU pointer is not related to a file descriptor issue. Consider using a more appropriate error variant like InvalidPointer or adding a new NullGpuPointer error variant to better describe the actual problem.

Copilot · 2025-12-12T09:24:01Z

bindings/python/src/gds/storage.rs

+    pub unsafe fn read_to_device(
+        &self,
+        gpu_ptr: *mut std::ffi::c_void,
+        size: usize,
+        file_offset: usize,
+    ) -> Result<usize, GdsError> {


The safety documentation mentions that "The file has at least file_offset + size bytes" as a caller requirement, but this is actually validated by the function at lines 68-70. If the function performs this validation, it shouldn't be listed as a caller requirement in the safety documentation. Either remove the validation (making it truly unsafe) or update the documentation to only list the actual unsafe requirements (valid GPU pointer and buffer size).

Copilot · 2025-12-12T09:24:01Z

bindings/python/src/gds/bindings.rs

+    /// Returns number of bytes read on success, negative on error
+    pub fn cuFileRead(
+        handle: CUfileHandle_t,
+        buf: *mut c_void,
+        size: c_size_t,
+        file_offset: c_longlong,
+        dev_offset: c_longlong,
+    ) -> isize;


The documentation comment should clarify the behavior when a negative value is returned. Based on the usage in storage.rs, negative return values indicate errors, but the comment only mentions "number of bytes read on success, negative on error" without specifying what the negative value represents (error code, -errno, etc.).

Copilot · 2025-12-12T09:24:01Z

bindings/python/src/lib.rs

+                    let shape_vec = info.shape.to_vec();
+                    let shape: PyObject = shape_vec.clone().into_pyobject(py)?.into();
+
+                    // Create tensor with correct dtype and device
+                    let dtype_clone = dtype.clone_ref(py);
+                    let kwargs = [
+                        (intern!(py, "dtype"), dtype_clone),
+                        (intern!(py, "device"), device),
+                    ].into_py_dict(py)?;
+
+                    let shape_for_call: PyObject = shape_vec.into_pyobject(py)?.into();


The shape is being converted to PyObject twice: once at line 897 and again at line 906 (shape_for_call). This creates unnecessary duplication and the first conversion stored in 'shape' is only used later at line 925. Consider removing the first conversion and reusing shape_for_call, or clarifying why both are needed.

Copilot · 2025-12-12T09:24:02Z

bindings/python/src/lib.rs

+                    // Reshape if needed
+                    let tensor = tensor.getattr(intern!(py, "reshape"))?.call1((shape,))?;


Calling reshape after the GDS read may be unnecessary and could cause issues. The tensor is already created with the correct shape at line 907 using torch.empty(shape). The reshape call could potentially create a view or copy, which defeats the purpose of zero-copy GDS loading. Verify if reshape is actually needed here, or if the tensor already has the correct shape from creation.

Suggested change

// Reshape if needed

let tensor = tensor.getattr(intern!(py, "reshape"))?.call1((shape,))?;

// Tensor already has correct shape; no reshape needed

Copilot · 2025-12-12T09:24:04Z