[Feature Request] Saving GPU-Translated States for Fast CPU-to-GPU Transfers
Hello FAISS community and developers, we were curious whether there would be interest in implementing a version of our faster CPU-GPU translation modifications to FAISS.
Motivation
New architectures like the NVIDIA GH200 tightly couple the CPU and GPU with a fast NVLink interconnect (900 GB/s). While with traditional PCIe-based systems it does not make sense to improve the upload path (as the current approach just moves indices once and reuses them once in GPU memory), the GH200 provides some unique optimization opportunities. We have developed a hardware-software co-design based on our work published at ISCA '25 that hierarchically searches subsets of massive indices.
This approach allows a single GPU to effectively serve an index much larger than its own memory capacity by utilizing CPU memory and NVLink to host and serve portions of FAISS IVF indices. On coupled CPU-GPU systems, we have demonstrated and tested a latency improvement of up to 10x for large-scale index serving over CPU-only implementations.
Hardware-Software Co-design
- Baseline Construction: A monolithic index is constructed using K-means clustering, after which all IVF lists (
nLists) and centroids are extracted and stored (e.g., in a NumPy array) for efficient lookup.
- Sizing & Grouping: The distribution of
nList sizes is analyzed to determine a safe upper bound for GPU memory. Based on centroid proximity, nLists are grouped and used to build smaller, targeted sub-indices.
- Mapping: A lookup table maps each
nList ID to its corresponding sub-index.
- Runtime Execution: * Sub-indices are warmed using a GPU preparation pipeline to eliminate setup overhead.
- Centroid data and the lookup table stay resident in GPU memory.
- Incoming queries identify the most relevant
nLists, and only those specific sub-indices are dynamically loaded onto the GPU via NVLink.
- GPU search retrieves top-k neighbors and returns associated document IDs.
By leveraging the high bandwidth of NVLink and the parallel processing power of the GPU, we serve these indices at 10x the speed of CPU-only implementations.
Feature Accommodations
To support these high-speed CPU-GPU movement paths, we are proposing some optimizations to the FAISS library to natively handle GPU-specific index representations through a two-stage warmup process:
1. Format Caching (One-time Operation)
The system translates the index into specialized binary files that reflect the GPU’s native memory layout:
gpu_codes_all.bin
gpu_indices_all.bin
gpu_codes_all.meta
By pre-calculating vector formats and structural offsets, we transform the index into a contiguous memory array. This replaces the several pointer updates the current translation function does with a single, batched cudaMemcpyAsync.
2. Memory Pipeline Optimization
We stage these pre-formatted caches in a global buffer to enable direct DMA transfers. This bypasses the original index’s inverted lists to avoid redundant loading. By pre-allocating GPU buffers and utilizing multiple CUDA streams for parallelized uploads, we maximize the utilization of the 900 GB/s NVLink interconnect.
Implementation Details
We have implemented and tested an example of these changes on a GH200 for IVF SQ8 type indices in our fork:
[Feature Request] Saving GPU-Translated States for Fast CPU-to-GPU Transfers
Hello FAISS community and developers, we were curious whether there would be interest in implementing a version of our faster CPU-GPU translation modifications to FAISS.
Motivation
New architectures like the NVIDIA GH200 tightly couple the CPU and GPU with a fast NVLink interconnect (900 GB/s). While with traditional PCIe-based systems it does not make sense to improve the upload path (as the current approach just moves indices once and reuses them once in GPU memory), the GH200 provides some unique optimization opportunities. We have developed a hardware-software co-design based on our work published at ISCA '25 that hierarchically searches subsets of massive indices.
This approach allows a single GPU to effectively serve an index much larger than its own memory capacity by utilizing CPU memory and NVLink to host and serve portions of FAISS IVF indices. On coupled CPU-GPU systems, we have demonstrated and tested a latency improvement of up to 10x for large-scale index serving over CPU-only implementations.
Hardware-Software Co-design
nLists) and centroids are extracted and stored (e.g., in a NumPy array) for efficient lookup.nListsizes is analyzed to determine a safe upper bound for GPU memory. Based on centroid proximity,nListsare grouped and used to build smaller, targeted sub-indices.nListID to its corresponding sub-index.nLists, and only those specific sub-indices are dynamically loaded onto the GPU via NVLink.By leveraging the high bandwidth of NVLink and the parallel processing power of the GPU, we serve these indices at 10x the speed of CPU-only implementations.
Feature Accommodations
To support these high-speed CPU-GPU movement paths, we are proposing some optimizations to the FAISS library to natively handle GPU-specific index representations through a two-stage warmup process:
1. Format Caching (One-time Operation)
The system translates the index into specialized binary files that reflect the GPU’s native memory layout:
gpu_codes_all.bingpu_indices_all.bingpu_codes_all.metaBy pre-calculating vector formats and structural offsets, we transform the index into a contiguous memory array. This replaces the several pointer updates the current translation function does with a single, batched
cudaMemcpyAsync.2. Memory Pipeline Optimization
We stage these pre-formatted caches in a global buffer to enable direct DMA transfers. This bypasses the original index’s inverted lists to avoid redundant loading. By pre-allocating GPU buffers and utilizing multiple CUDA streams for parallelized uploads, we maximize the utilization of the 900 GB/s NVLink interconnect.
Implementation Details
We have implemented and tested an example of these changes on a GH200 for IVF SQ8 type indices in our fork: