[Feature Request] Saving GPU-Translated States for Fast CPU-to-GPU Transfers

# [Feature Request] Saving GPU-Translated States for Fast CPU-to-GPU Transfers

Hello FAISS community and developers, we were curious whether there would be interest in implementing a version of our faster CPU-GPU translation modifications to FAISS.

## Motivation
New architectures like the **NVIDIA GH200** tightly couple the CPU and GPU with a fast **NVLink** interconnect (900 GB/s). While with traditional PCIe-based systems it does not make sense to improve the upload path (as the current approach just moves indices once and reuses them once in GPU memory), the GH200 provides some unique optimization opportunities. We have developed a hardware-software co-design based on our work published at [ISCA '25](https://dl.acm.org/doi/10.1145/3695053.3731076) that hierarchically searches subsets of massive indices.

This approach allows a single GPU to effectively serve an index much larger than its own memory capacity by utilizing CPU memory and NVLink to host and serve portions of **FAISS** IVF indices. On coupled CPU-GPU systems, we have demonstrated and tested a latency improvement of up to **10x** for large-scale index serving over CPU-only implementations.

---

## Hardware-Software Co-design
1.  **Baseline Construction:** A monolithic index is constructed using **K-means clustering**, after which all IVF lists (`nLists`) and centroids are extracted and stored (e.g., in a NumPy array) for efficient lookup.
2.  **Sizing & Grouping:** The distribution of `nList` sizes is analyzed to determine a safe upper bound for GPU memory. Based on centroid proximity, `nLists` are grouped and used to build smaller, targeted sub-indices.
3.  **Mapping:** A lookup table maps each `nList` ID to its corresponding sub-index.
4.  **Runtime Execution:** * Sub-indices are warmed using a GPU preparation pipeline to eliminate setup overhead. 
    * Centroid data and the lookup table stay resident in GPU memory. 
    * Incoming queries identify the most relevant `nLists`, and only those specific sub-indices are dynamically loaded onto the GPU via NVLink.
    * GPU search retrieves top-k neighbors and returns associated document IDs.

By leveraging the high bandwidth of NVLink and the parallel processing power of the GPU, we serve these indices at 10x the speed of CPU-only implementations.

---

## Feature Accommodations
To support these high-speed CPU-GPU movement paths, we are proposing some optimizations to the **FAISS** library to natively handle GPU-specific index representations through a **two-stage warmup process**:

### 1. Format Caching (One-time Operation)
The system translates the index into specialized binary files that reflect the GPU’s native memory layout:
* `gpu_codes_all.bin`
* `gpu_indices_all.bin`
* `gpu_codes_all.meta`

By pre-calculating vector formats and structural offsets, we transform the index into a contiguous memory array. This replaces the several pointer updates the current translation function does with a single, batched `cudaMemcpyAsync`.

### 2. Memory Pipeline Optimization
We stage these pre-formatted caches in a global buffer to enable direct DMA transfers. This bypasses the original index’s inverted lists to avoid redundant loading. By pre-allocating GPU buffers and utilizing multiple CUDA streams for parallelized uploads, we maximize the utilization of the 900 GB/s NVLink interconnect.

---

## Implementation Details
We have implemented and tested an example of these changes on a GH200 for **IVF SQ8** type indices in our fork:
* **Repository:** [S4AI-CornellTech/faiss-dev](https://github.com/S4AI-CornellTech/faiss-dev)
* **Primary Changes:** [IVFBase.cu](https://github.com/S4AI-CornellTech/faiss-dev/blob/main/faiss/gpu/impl/IVFBase.cu)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Saving GPU-Translated States for Fast CPU-to-GPU Transfers #5039

[Feature Request] Saving GPU-Translated States for Fast CPU-to-GPU Transfers

Motivation

Hardware-Software Co-design

Feature Accommodations

1. Format Caching (One-time Operation)

2. Memory Pipeline Optimization

Implementation Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request] Saving GPU-Translated States for Fast CPU-to-GPU Transfers #5039

Description

[Feature Request] Saving GPU-Translated States for Fast CPU-to-GPU Transfers

Motivation

Hardware-Software Co-design

Feature Accommodations

1. Format Caching (One-time Operation)

2. Memory Pipeline Optimization

Implementation Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions