GPUDirect Storage (GDS) Guide

GPU Direct Storage (GDS) enables direct DMA transfers between GPU memory and NVMe/NVMe-oF storage, bypassing the CPU staging buffer entirely. This can significantly improve throughput for KV-cache offloading.

GDS uses the cuFile library by default for all GPU↔storage transfers. cuFile behavior (I/O threads, bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a cufile_rdma.json config file — see Tuning cuFile for an example.

Requirements

NVIDIA GPU with GDS support
cuFile library installed at runtime (nvidia-fs kernel module + libcufile)
Filesystem that supports O_DIRECT (e.g., ext4, xfs, NVMe-oF — not most NFS/FUSE mounts). GDS will still work on unsupported filesystems but will not give the expected benefit of bypassing the CPU.

Default: GDS is disabled by default. The connector works without GDS using CPU staging buffers. Only enable GDS if your hardware and filesystem support it.

Build

The wheel has no build-time dependency on cuFile. GDS support is loaded at runtime via dlopen("libcufile.so") — the same compiled wheel works on systems with or without GDS.

pip install -e .

To enable GDS at runtime, install cuFile on the target machine: apt-get install -y cuda-cufile-12-9 (adjust version to match your CUDA install). If libcufile.so is not present at runtime, the connector falls back to CPU staging automatically.

GDS modes

`backend` value	Read	Write
`POSIX` (default)	CPU staging	CPU staging
`GDS_READ`	GDS direct	CPU staging
`GDS_WRITE`	CPU staging	GDS direct
`GDS`	GDS direct	GDS direct
`GDS_BB_READ`	GDS + Bounce Buffer	CPU staging
`GDS_BB_WRITE`	CPU staging	GDS + Bounce Buffer
`GDS_BB`	GDS + Bounce Buffer	GDS + Bounce Buffer

Bounce Buffer (BB) modes use GDS with an intermediate RDMA-registered GPU buffer instead of registering each KV cache block directly. This is useful when the number of GPU KV cache blocks is large and per-block registration would exceed GDS registration table limits. In direct mode, cuFile performs DMA directly to/from registered GPU buffers. In BB mode, cuFile routes data through a single pre-registered bounce buffer, adding one extra GPU-to-GPU copy but avoiding per-block registration overhead.

Configuration

Add backend to kv_connector_extra_config in your vLLM config:

--kv-transfer-config '{
  "kv_connector": "OffloadingConnector",
  "kv_role": "kv_both",
  "kv_connector_extra_config": {
    "spec_name": "SharedStorageOffloadingSpec",
    "spec_module_path": "llmd_fs_backend.spec",
    "shared_storage_path": "/mnt/nvme/kv-cache/",
    "block_size": 256,
    "threads_per_gpu": "64",
    "backend": "GDS"
  }
}'

If you are unsure which backend to use, start with GDS for NVMe local disks or GDS_BB for shared storage.

Verify GDS is active

Set STORAGE_LOG_LEVEL=info (the default) and look for these lines at startup:

GdsFileIO: GPUDirect Storage (GDS) enabled

If GDS initialization fails, the connector automatically falls back to CPU staging and logs:

StorageOffloadEngine: GDS initialization failed, falling back to CPU_BUFFER_STAGE for both READ and WRITE
StorageOffloadEngine: READ=CPU WRITE=CPU

For more detail during startup, use STORAGE_LOG_LEVEL=debug:

STORAGE_LOG_LEVEL=debug vllm serve ...

Tuning cuFile (`cufile_rdma.json`)

cuFile reads its configuration from a JSON file. By default it looks in /etc/cufile.json. To use a custom path:

export CUFILE_ENV_PATH_JSON=/path/to/cufile_rdma.json

An example config is provided at docs/cufile_rdma.json. All settings are documented inline in that file. For RDMA storage, fill in your NIC IPs:

"rdma_dev_addr_list": [ "<RDMA_NIC_IP_1>", "<RDMA_NIC_IP_2>" ]

To find your RDMA NIC IPs:

ibdev2netdev    # maps InfiniBand devices to network interfaces
ip addr show    # note the IPs of those interfaces

Your local cufile_rdma.json (with deployment-specific IPs) should not be committed to version control.

Troubleshooting

`cuFileDriverOpen` fails

Check that the nvidia-fs kernel module is loaded: lsmod | grep nvidia_fs
Install the cuFile package: apt-get install -y cuda-cufile-12-9

GDS not supported on this GPU

GDS requires cudaDevAttrGPUDirectRDMASupported = 1
Check: nvidia-smi --query-gpu=name --format=csv — A100, H100, and most data-center GPUs support it
Consumer GPUs (RTX series) typically do not

Filesystem does not support O_DIRECT

NFS, FUSE-based mounts, and tmpfs do not support O_DIRECT
Use GDS_BB (Bounce Buffer mode) as a fallback for these filesystems
For full GDS performance, use a local NVMe or NVMe-oF mount

GDS falls back silently to CPU

The connector falls back automatically if GDS init fails
Always check startup logs for READ=GDS_DIRECT / READ=GDS_BOUNCE_BUFFER / READ=CPU to confirm the active mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUDirect Storage (GDS) Guide

Requirements

Build

GDS modes

Configuration

Verify GDS is active

Tuning cuFile (`cufile_rdma.json`)

Troubleshooting

`cuFileDriverOpen` fails

GDS not supported on this GPU

Filesystem does not support O_DIRECT

GDS falls back silently to CPU

FilesExpand file tree

gds.md

Latest commit

History

gds.md

File metadata and controls

GPUDirect Storage (GDS) Guide

Requirements

Build

GDS modes

Configuration

Verify GDS is active

Tuning cuFile (cufile_rdma.json)

Troubleshooting

cuFileDriverOpen fails

GDS not supported on this GPU

Filesystem does not support O_DIRECT

GDS falls back silently to CPU

Tuning cuFile (`cufile_rdma.json`)

`cuFileDriverOpen` fails