GPU Direct Storage (GDS) enables direct DMA transfers between GPU memory and NVMe/NVMe-oF storage, bypassing the CPU staging buffer entirely. This can significantly improve throughput for KV-cache offloading.
GDS uses the cuFile library by default for all GPU↔storage transfers. cuFile behavior (I/O threads,
bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a cufile_rdma.json
config file — see Tuning cuFile for an example.
- NVIDIA GPU with GDS support
- cuFile library installed at runtime (
nvidia-fskernel module +libcufile) - Filesystem that supports O_DIRECT (e.g., ext4, xfs, NVMe-oF — not most NFS/FUSE mounts). GDS will still work on unsupported filesystems but will not give the expected benefit of bypassing the CPU.
Default: GDS is disabled by default. The connector works without GDS using CPU staging buffers. Only enable GDS if your hardware and filesystem support it.
The wheel has no build-time dependency on cuFile. GDS support is loaded at runtime
via dlopen("libcufile.so") — the same compiled wheel works on systems with or without GDS.
pip install -e .To enable GDS at runtime, install cuFile on the target machine:
apt-get install -y cuda-cufile-12-9 (adjust version to match your CUDA install).
If libcufile.so is not present at runtime, the connector falls back to CPU staging automatically.
backend value |
Read | Write |
|---|---|---|
POSIX (default) |
CPU staging | CPU staging |
GDS_READ |
GDS direct | CPU staging |
GDS_WRITE |
CPU staging | GDS direct |
GDS |
GDS direct | GDS direct |
GDS_BB_READ |
GDS + Bounce Buffer | CPU staging |
GDS_BB_WRITE |
CPU staging | GDS + Bounce Buffer |
GDS_BB |
GDS + Bounce Buffer | GDS + Bounce Buffer |
Bounce Buffer (BB) modes use GDS with an intermediate RDMA-registered GPU buffer instead of registering each KV cache block directly. This is useful when the number of GPU KV cache blocks is large and per-block registration would exceed GDS registration table limits. In direct mode, cuFile performs DMA directly to/from registered GPU buffers. In BB mode, cuFile routes data through a single pre-registered bounce buffer, adding one extra GPU-to-GPU copy but avoiding per-block registration overhead.
Add backend to kv_connector_extra_config in your vLLM config:
--kv-transfer-config '{
"kv_connector": "OffloadingConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"spec_name": "SharedStorageOffloadingSpec",
"spec_module_path": "llmd_fs_backend.spec",
"shared_storage_path": "/mnt/nvme/kv-cache/",
"block_size": 256,
"threads_per_gpu": "64",
"backend": "GDS"
}
}'If you are unsure which backend to use, start with GDS for NVMe local disks or GDS_BB for shared storage.
Set STORAGE_LOG_LEVEL=info (the default) and look for these lines at startup:
GdsFileIO: GPUDirect Storage (GDS) enabled
If GDS initialization fails, the connector automatically falls back to CPU staging and logs:
StorageOffloadEngine: GDS initialization failed, falling back to CPU_BUFFER_STAGE for both READ and WRITE
StorageOffloadEngine: READ=CPU WRITE=CPU
For more detail during startup, use STORAGE_LOG_LEVEL=debug:
STORAGE_LOG_LEVEL=debug vllm serve ...cuFile reads its configuration from a JSON file. By default it looks in /etc/cufile.json.
To use a custom path:
export CUFILE_ENV_PATH_JSON=/path/to/cufile_rdma.jsonAn example config is provided at docs/cufile_rdma.json.
All settings are documented inline in that file. For RDMA storage, fill in your NIC IPs:
"rdma_dev_addr_list": [ "<RDMA_NIC_IP_1>", "<RDMA_NIC_IP_2>" ]To find your RDMA NIC IPs:
ibdev2netdev # maps InfiniBand devices to network interfaces
ip addr show # note the IPs of those interfacesYour local
cufile_rdma.json(with deployment-specific IPs) should not be committed to version control.
- Check that the
nvidia-fskernel module is loaded:lsmod | grep nvidia_fs - Install the cuFile package:
apt-get install -y cuda-cufile-12-9
- GDS requires
cudaDevAttrGPUDirectRDMASupported = 1 - Check:
nvidia-smi --query-gpu=name --format=csv— A100, H100, and most data-center GPUs support it - Consumer GPUs (RTX series) typically do not
- NFS, FUSE-based mounts, and tmpfs do not support O_DIRECT
- Use
GDS_BB(Bounce Buffer mode) as a fallback for these filesystems - For full GDS performance, use a local NVMe or NVMe-oF mount
- The connector falls back automatically if GDS init fails
- Always check startup logs for
READ=GDS_DIRECT/READ=GDS_BOUNCE_BUFFER/READ=CPUto confirm the active mode