-
Notifications
You must be signed in to change notification settings - Fork 284
nvidia gds support #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
nvidia gds support #676
Conversation
- try to load with GDS if not mentioned explicitly - flag to control load with gds - load in dynamic chunk for gds
|
also could be related to: #528 Adding this would be great support for mid range GPUs. Just slap in a SSD, and you are okay to go. Definetly not good, but it works. |
|
|
||
| # GDS (GPU Direct Storage) support | ||
| try: | ||
| import cupy as cp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's yet another library, we can't have another dependency here. Even if it was a core dependency of torch we wouldn't access it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood the architecture, moved GDS related code in rust code with FFI calling with libc
Narsil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole code looks honestly pretty bad (looks like AI slop to me tbh, sorry if you handcrafted this).
- We need benchmarks to showcase how it differs. Given DirectGPU support, I think being VERY precise in the environment the thing was tested on is very important. GDS in previous tests I have made was not really bringing much to the table. Hardware/cloud vendors are usually making it hard to use those features efficiently.
- Remove the public surface. GDS, just like memory mapping should be an implementation detail, not something users should care about. If it's not OBVIOUSLY faster, then we shouldn't use it
- The whole things still goes through CPU, it's negating the entire benefit of using GDS.
|
@johnnynunez would anyone from Nvidia be willing to help on this. It would be particularly useful to the Spark with its current mmap issue but I think it has performance benefits beyond that use case and safetensors is more widely adopted than fastsafetensors that already supports GDS but does not work with distributed vllm. |
|
Is the benchmark provided which shows GDS as always slower on the code as of bc580db? It might make sense to also provide the GPU used, versions of involved Nvidia components, the benchmark code, the command used and where to obtain the .safetensors file used. Have you benchmarked with and without cupy use? |
- multiple loading - preserving shape and data - failing without cuda gpu, like cpu and other gpu - different type
- bigger size performs better - helps during cpu is quite busy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds GPU Direct Storage (GDS) support for PyTorch in safetensors, enabling zero-copy data transfer directly from NVMe storage to GPU memory, bypassing the CPU. This feature is currently Linux-only and requires NVIDIA's cuFile library (libcufile.so).
Key Changes:
- Rust FFI bindings to NVIDIA's cuFile library for GDS operations
- New storage backend (
CudaGds) that performs direct NVMe-to-GPU transfers - Python API extension with
use_gdsparameter forsafe_openand_safe_open_handle
Reviewed changes
Copilot reviewed 10 out of 13 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
| bindings/python/tests/test_gds.py | Comprehensive test suite for GDS functionality with various tensor sizes and data types |
| bindings/python/src/lib.rs | Integration of GDS storage backend with validation and tensor loading logic |
| bindings/python/src/gds/storage.rs | GDS storage implementation with read-to-device functionality |
| bindings/python/src/gds/mod.rs | Module organization for GDS components |
| bindings/python/src/gds/handle.rs | cuFile handle management with RAII semantics |
| bindings/python/src/gds/error.rs | Error types for GDS operations |
| bindings/python/src/gds/driver.rs | cuFile driver lifecycle management with singleton pattern |
| bindings/python/src/gds/bindings.rs | Low-level FFI bindings to libcufile.so |
| bindings/python/py_src/safetensors/torch.py | Whitespace-only formatting fix |
| bindings/python/py_src/safetensors/init.py | Whitespace-only formatting fix |
| bindings/python/benches/test_gds.py | Performance benchmarking suite comparing GDS vs standard loading |
| bindings/python/Cargo.toml | Added libc dependency for FFI types |
| .gitignore | Added cufile.log to ignored files |
Comments suppressed due to low confidence (1)
bindings/python/tests/test_gds.py:5
- Import of 'np' is not used.
import numpy as np
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if gpu_ptr.is_null() { | ||
| return Err(GdsError::InvalidFileDescriptor); | ||
| } |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning InvalidFileDescriptor error when gpu_ptr is null is misleading. A null GPU pointer is not related to a file descriptor issue. Consider using a more appropriate error variant like InvalidPointer or adding a new NullGpuPointer error variant to better describe the actual problem.
| pub unsafe fn read_to_device( | ||
| &self, | ||
| gpu_ptr: *mut std::ffi::c_void, | ||
| size: usize, | ||
| file_offset: usize, | ||
| ) -> Result<usize, GdsError> { |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safety documentation mentions that "The file has at least file_offset + size bytes" as a caller requirement, but this is actually validated by the function at lines 68-70. If the function performs this validation, it shouldn't be listed as a caller requirement in the safety documentation. Either remove the validation (making it truly unsafe) or update the documentation to only list the actual unsafe requirements (valid GPU pointer and buffer size).
| /// Returns number of bytes read on success, negative on error | ||
| pub fn cuFileRead( | ||
| handle: CUfileHandle_t, | ||
| buf: *mut c_void, | ||
| size: c_size_t, | ||
| file_offset: c_longlong, | ||
| dev_offset: c_longlong, | ||
| ) -> isize; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation comment should clarify the behavior when a negative value is returned. Based on the usage in storage.rs, negative return values indicate errors, but the comment only mentions "number of bytes read on success, negative on error" without specifying what the negative value represents (error code, -errno, etc.).
| let shape_vec = info.shape.to_vec(); | ||
| let shape: PyObject = shape_vec.clone().into_pyobject(py)?.into(); | ||
|
|
||
| // Create tensor with correct dtype and device | ||
| let dtype_clone = dtype.clone_ref(py); | ||
| let kwargs = [ | ||
| (intern!(py, "dtype"), dtype_clone), | ||
| (intern!(py, "device"), device), | ||
| ].into_py_dict(py)?; | ||
|
|
||
| let shape_for_call: PyObject = shape_vec.into_pyobject(py)?.into(); |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shape is being converted to PyObject twice: once at line 897 and again at line 906 (shape_for_call). This creates unnecessary duplication and the first conversion stored in 'shape' is only used later at line 925. Consider removing the first conversion and reusing shape_for_call, or clarifying why both are needed.
| // Reshape if needed | ||
| let tensor = tensor.getattr(intern!(py, "reshape"))?.call1((shape,))?; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling reshape after the GDS read may be unnecessary and could cause issues. The tensor is already created with the correct shape at line 907 using torch.empty(shape). The reshape call could potentially create a view or copy, which defeats the purpose of zero-copy GDS loading. Verify if reshape is actually needed here, or if the tensor already has the correct shape from creation.
| // Reshape if needed | |
| let tensor = tensor.getattr(intern!(py, "reshape"))?.call1((shape,))?; | |
| // Tensor already has correct shape; no reshape needed |
| for _ in range(iterations): | ||
| start = time.perf_counter() | ||
| with safe_open(filename, framework="pt", device="cuda:0", use_gds=False) as f: | ||
| tensor = f.get_tensor(tensor_name) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable tensor is not used.
| for _ in range(iterations): | ||
| start = time.perf_counter() | ||
| with safe_open(filename, framework="pt", device="cuda:0", use_gds=True) as f: | ||
| tensor = f.get_tensor(tensor_name) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable tensor is not used.
| # Warmup | ||
| for _ in range(warmup): | ||
| with safe_open(filename, framework="pt", device="cuda:0", use_gds=False) as f: | ||
| tensor = f.get_tensor(tensor_name) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment to 'tensor' is unnecessary as it is redefined before this value is used.
| tensor = f.get_tensor(tensor_name) | |
| f.get_tensor(tensor_name) |
| # Warmup | ||
| for _ in range(warmup): | ||
| with safe_open(filename, framework="pt", device="cuda:0", use_gds=True) as f: | ||
| tensor = f.get_tensor(tensor_name) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment to 'tensor' is unnecessary as it is redefined before this value is used.
| tensor = f.get_tensor(tensor_name) | |
| f.get_tensor(tensor_name) |
| import os | ||
| import tempfile | ||
| import time | ||
| from typing import Dict, List, Tuple |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'List' is not used.
| from typing import Dict, List, Tuple | |
| from typing import Dict, Tuple |
|
The benchmarking and testing notebook can be found here, just run all: https://gist.github.com/maifeeulasad/741df3cf99a9970ce7a4897d5583b223 Kaggle uses loopback FS or tmpfs or overlay FS or container-mounted ephemeral disk. But gds requires the following:
So I wasn't able to test in Kaggle. Haven't tested on Colab yet, but I think the situation would be pretty much same. And most of the data centers still uses HDDs. Based on what I have implemented so far, and tested so far,
So I see usage of GDS in real life scenario. Where loads are quite noisy, specially when we are dealing with lots of models and application. I will try to share more information soon. Open for discussion. ref: |
|
This adds cufile as a hard dependency, failing to build on any platform without it: |
|
Excellent edge case, now I have added a feature as rail guard to prevent leaking GDS code into regular build. Please check. So GDS features need to build explicitly, which is even better, as per my understanding. Updated notebook can be found in the previous url, here goes the url for you convinience: https://gist.github.com/maifeeulasad/741df3cf99a9970ce7a4897d5583b223#file-safetensors-benchmarking-withand-without-gds-ipynb Open for further review, discussion, exploration. |
|
Please explain why cupy is required. What limitations are there in the pytorch API that makes it impossible to implement without adding a dependency on cupy? |
| [features] | ||
| default = [] | ||
| cuda-gds = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is not exposed in the python binding, it forces the user to manually invoke maturin which contradicts the README.
|
I don't think I understand, is the GDS reader implemented in Python? What if I want to use safetensors from a rust or C project? Shouldn't the reader be implemented in native rust and only have bindings for Python? |
|
The GDS reader is implemented in Rust with Python bindings. It however currently requires calling |

What does this PR do?
Fixes #299
This PR adds support for GDS (GPU direct access) for pytorch for now. If I get positive feedback, definetly I am willing to integrate in other modules as well.
Open for making changes based on review.