Skip to content

Latest commit

 

History

History
142 lines (100 loc) · 5 KB

File metadata and controls

142 lines (100 loc) · 5 KB

GPUDirect Storage (GDS) Guide

GPU Direct Storage (GDS) enables direct DMA transfers between GPU memory and NVMe/NVMe-oF storage, bypassing the CPU staging buffer entirely. This can significantly improve throughput for KV-cache offloading.

GDS uses the cuFile library by default for all GPU↔storage transfers. cuFile behavior (I/O threads, bounce buffer sizes, RDMA settings, filesystem-specific tuning) can be customized via a cufile_rdma.json config file — see Tuning cuFile for an example.

Requirements

  • NVIDIA GPU with GDS support
  • cuFile library installed at runtime (nvidia-fs kernel module + libcufile)
  • Filesystem that supports O_DIRECT (e.g., ext4, xfs, NVMe-oF — not most NFS/FUSE mounts). GDS will still work on unsupported filesystems but will not give the expected benefit of bypassing the CPU.

Default: GDS is disabled by default. The connector works without GDS using CPU staging buffers. Only enable GDS if your hardware and filesystem support it.

Build

The wheel has no build-time dependency on cuFile. GDS support is loaded at runtime via dlopen("libcufile.so") — the same compiled wheel works on systems with or without GDS.

pip install -e .

To enable GDS at runtime, install cuFile on the target machine: apt-get install -y cuda-cufile-12-9 (adjust version to match your CUDA install). If libcufile.so is not present at runtime, the connector falls back to CPU staging automatically.

GDS modes

backend value Read Write
POSIX (default) CPU staging CPU staging
GDS_READ GDS direct CPU staging
GDS_WRITE CPU staging GDS direct
GDS GDS direct GDS direct
GDS_BB_READ GDS + Bounce Buffer CPU staging
GDS_BB_WRITE CPU staging GDS + Bounce Buffer
GDS_BB GDS + Bounce Buffer GDS + Bounce Buffer

Bounce Buffer (BB) modes use GDS with an intermediate RDMA-registered GPU buffer instead of registering each KV cache block directly. This is useful when the number of GPU KV cache blocks is large and per-block registration would exceed GDS registration table limits. In direct mode, cuFile performs DMA directly to/from registered GPU buffers. In BB mode, cuFile routes data through a single pre-registered bounce buffer, adding one extra GPU-to-GPU copy but avoiding per-block registration overhead.

Configuration

Add backend to kv_connector_extra_config in your vLLM config:

--kv-transfer-config '{
  "kv_connector": "OffloadingConnector",
  "kv_role": "kv_both",
  "kv_connector_extra_config": {
    "spec_name": "SharedStorageOffloadingSpec",
    "spec_module_path": "llmd_fs_backend.spec",
    "shared_storage_path": "/mnt/nvme/kv-cache/",
    "block_size": 256,
    "threads_per_gpu": "64",
    "backend": "GDS"
  }
}'

If you are unsure which backend to use, start with GDS for NVMe local disks or GDS_BB for shared storage.

Verify GDS is active

Set STORAGE_LOG_LEVEL=info (the default) and look for these lines at startup:

GdsFileIO: GPUDirect Storage (GDS) enabled

If GDS initialization fails, the connector automatically falls back to CPU staging and logs:

StorageOffloadEngine: GDS initialization failed, falling back to CPU_BUFFER_STAGE for both READ and WRITE
StorageOffloadEngine: READ=CPU WRITE=CPU

For more detail during startup, use STORAGE_LOG_LEVEL=debug:

STORAGE_LOG_LEVEL=debug vllm serve ...

Tuning cuFile (cufile_rdma.json)

cuFile reads its configuration from a JSON file. By default it looks in /etc/cufile.json. To use a custom path:

export CUFILE_ENV_PATH_JSON=/path/to/cufile_rdma.json

An example config is provided at docs/cufile_rdma.json. All settings are documented inline in that file. For RDMA storage, fill in your NIC IPs:

"rdma_dev_addr_list": [ "<RDMA_NIC_IP_1>", "<RDMA_NIC_IP_2>" ]

To find your RDMA NIC IPs:

ibdev2netdev    # maps InfiniBand devices to network interfaces
ip addr show    # note the IPs of those interfaces

Your local cufile_rdma.json (with deployment-specific IPs) should not be committed to version control.

Troubleshooting

cuFileDriverOpen fails

  • Check that the nvidia-fs kernel module is loaded: lsmod | grep nvidia_fs
  • Install the cuFile package: apt-get install -y cuda-cufile-12-9

GDS not supported on this GPU

  • GDS requires cudaDevAttrGPUDirectRDMASupported = 1
  • Check: nvidia-smi --query-gpu=name --format=csv — A100, H100, and most data-center GPUs support it
  • Consumer GPUs (RTX series) typically do not

Filesystem does not support O_DIRECT

  • NFS, FUSE-based mounts, and tmpfs do not support O_DIRECT
  • Use GDS_BB (Bounce Buffer mode) as a fallback for these filesystems
  • For full GDS performance, use a local NVMe or NVMe-oF mount

GDS falls back silently to CPU

  • The connector falls back automatically if GDS init fails
  • Always check startup logs for READ=GDS_DIRECT / READ=GDS_BOUNCE_BUFFER / READ=CPU to confirm the active mode