- 
                Notifications
    
You must be signed in to change notification settings  - Fork 140
 
Add Augmented Core Extraction Algorithm #1404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 
           Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here.  | 
    
| 
           /ok to test 38c03a4  | 
    
- Adds the out-of-tree ACE method of @anaruse. This assumes graphs smaller than host memory. - Adds disk_enabled` and `graph_build_dir` parameters to select ACE method.
- Use partitions instead of clusters in ACE to distinguish between ACE clusters and regular KNN graph building clusters.
- Introduced dynamic configuration of nprobes and nlists for IVF-PQ based on partition size to improve KNN graph construction. - Added logging for both IVF-PQ and NN-Descent parameters to provide better insights during the graph building process. - Ensured default parameters are set when no specific graph build parameters are provided.
- Added logic to identify and merge small partitions that do not meet the minimum size requirement for stable KNN graph construction.
- Replaced `disk_enabled` and `graph_build_dir` with `ace_npartitions` and `ace_build_dir` in the parameter parsing logic. - Updated function signatures and documentation to clarify the new partitioning approach for ACE builds.
- Introduced new functions for reordering and storing datasets on disk, optimizing for NVMe performance. - Clarified namings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Julian. for the updates, it is great to have the tests! I have a few comments to the new code.
- Fixed an assertion to validate the size of data per element in the HNSW serialization function. Moved out of main loop. - Enhanced memory allocation logic in the CAGRA build process, ensuring proper initialization and reducing potential errors. - Optimized and parallelized some parts.
- Implements cuvsCagraIndexIsOnDisk and cuvsCagraIndexGetFileDirectory.
- Add NumPy headers. - Introduced methods to update the dataset and graph from disk files using NumPy header. - Use file I/O instead of mmap in serialize_to_hnswlib_from_disk. - Move buffered_ofstream to file_io.hpp.
- Renamed to cagra_hnsw_ace_example to clarify that this uses HNSW for searching. - Use HNSW for search on memory path as well.
| * @param[in] params cuvsAceParams_t to allocate | ||
| * @return cuvsError_t | ||
| */ | ||
| cuvsError_t cuvsAceParamsCreate(cuvsAceParams_t* params); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julianmi It looks like you forgot to push/add the C test
- Replaced ASSERT with RAFT_EXPECTS for better error handling in cuvs_cagra_hnswlib_wrapper.h. - Added a warning log for small dataset sizes in cagra_build.cuh. - Adjusted min_recall value in tests to improve test accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Julian, a few nitpicks on my side
        
          
                cpp/include/cuvs/neighbors/cagra.hpp
              
                Outdated
          
        
      | template <typename T, typename IdxT> | ||
| auto hnsw_to_cagra_params(raft::matrix_extent<int64_t> dataset, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the unused template parameters.
The function will become a non-template and you'll need to decide in which compilation unit to place it. If #1448 doesn't get merged before this PR, I'd suggest to just copy the relevant bits here (create a separate cagra.cpp unit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (last_slash != std::string::npos) { | ||
| file_directory_ = file_path.substr(0, last_slash); | ||
| } else { | ||
| file_directory_ = "."; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would yield the current working directory, right?
I'm wondering, would it make sense to default to something like std::tmpfile for the file path instead? From past experience, it sometimes a little annoying when cuVS creates some random/index files in the root of the git directory tree (project folder) and one has to clean it up. Also an environment where the working directory is the executable directory and is not writable could be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do we really need to keep the file_directory_ / file_directory() members? I tried to search through the code where is it really needed and couldn't find any. Do we even need file paths, or perhaps we could get away with cuvs::util::file_descriptor handles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the user-provided build_dir parameter as part of the ACE build method. This enables users to pick a fast disk to speed up the build. /tmp might not be the fastest disk available and/or might not have the capacity to hold the temporary files and HNSW index.
file_directory() is used during HNSW index creation. The disk path was added in from_cagra() and the directory is used in the new serialize_to_hnswlib_from_disk() routine.
        
          
                cpp/src/util/file_io.hpp
              
                Outdated
          
        
      | file_descriptor(const file_descriptor&) = delete; | ||
| file_descriptor& operator=(const file_descriptor&) = delete; | ||
| 
               | 
          ||
| file_descriptor(file_descriptor&& other) noexcept : fd_(other.fd_) { other.fd_ = -1; } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nitpick
| file_descriptor(file_descriptor&& other) noexcept : fd_(other.fd_) { other.fd_ = -1; } | |
| file_descriptor(file_descriptor&& other) noexcept : fd_{std::exchange(other.fd_, -1)} {} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied.
| { | ||
| if (this != &other) { | ||
| close(); | ||
| fd_ = other.fd_; | ||
| other.fd_ = -1; | ||
| } | ||
| return *this; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nitpick:
| { | |
| if (this != &other) { | |
| close(); | |
| fd_ = other.fd_; | |
| other.fd_ = -1; | |
| } | |
| return *this; | |
| } | |
| { | |
| std::swap(this->fd_, other.fd_); | |
| return *this; | |
| } | 
Note you don't need to manually close the handle here; the destructor will be called on the moved-from object anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have finished reviewing the PR. I have created Issue #1486 to keep track of additional issues that I feel are out of scope of this PR.
        
          
                cpp/src/neighbors/detail/hnsw.hpp
              
                Outdated
          
        
      | os.write(reinterpret_cast<const char*>(graph_row), sizeof(IdxT) * graph_degree_int); | ||
| 
               | 
          ||
| if (odd_graph_degree) { | ||
| assert(odd_graph_degree == appr_algo->maxM0_ - graph_degree_int); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert is only active in debug mode and fails to compile due to -Werror
comparison of integer expressions of different signedness: 'int' and 'size_t'
Pleace replace it with RAFT_EXPECTS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with RAFT_EXPECTS and added type cast.
        
          
                cpp/src/neighbors/detail/hnsw.hpp
              
                Outdated
          
        
      | int64_t next_report_offset = d_report_offset; | ||
| auto start_clock = std::chrono::system_clock::now(); | ||
| 
               | 
          ||
| assert(appr_algo->size_data_per_element_ == | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would only be active in debug mode, let's use RAFT_EXPECTS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
        
          
                cpp/src/neighbors/detail/hnsw.hpp
              
                Outdated
          
        
      | size_t bytes_written = 0; | ||
| float GiB = 1 << 30; | ||
| IdxT zero = 0; | ||
| assert(appr_algo->size_data_per_element_ == | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use RAFT_EXPECTS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| offset(j) = offset(j - 1) + 1; | ||
| } | ||
| IdxT ofst = initial_graph_size * pow(base, (double)j - small_graph_degree - 1); | ||
| IdxT ofst = pow((double)(initial_graph_size - 1) / 2, (double)(j + 1) / small_graph_degree); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anaruse Could you comment on this please? This change was taken from your proposal.
- We will need to add the final APIs and align tests in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the work @julianmi ! I left a small comment and a question
| * in KNN graph construction. 100k - 5M vectors per partition is recommended | ||
| * depending on the available host and GPU memory. | ||
| */ | ||
| size_t npartitions; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we have 1M rows in the dataset, and say we choose each partition to have 200K vectors (from you range of recommended values). Does this mean n_partitions = 1M / 200K = 5? Would be nice if you could add a simple formula to help the user choose npartitions!
| if (use_disk_mode) { | ||
| // Load partition dataset from disk files | ||
| ace_load_partition_dataset_from_disk<T, IdxT>(res, | ||
| build_dir, | ||
| partition_id, | ||
| dataset_dim, | ||
| partition_histogram.view(), | ||
| core_partition_offsets.view(), | ||
| augmented_partition_offsets.view(), | ||
| sub_dataset.view()); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question; if we assume that the dataset can fit on memory anyways, why do we have to partition the data -> store to disk -> and then read per partitions again? can't we gather corresponding vectors from memory on-the-fly per partition if we have the necessary information (partition_histogram, core_partition_offsets, augmented_partition_offsets)?
This PR introduces Augmented Core Extraction (ACE), an approach proposed by @anaruse for building CAGRA indices on very large datasets that exceed GPU memory capacity. ACE enables users to build high-quality approximate nearest neighbor search indices on datasets that would otherwise be impossible to process on a single GPU. The approach uses the host memory if large enough and falls back to the disk if required.
This work is a collaboration: @anaruse, @tfeher, @achirkin, @mfoerste4
Algorithm Description
build_knn_graph()flow) with its primary vectors plus augmented vectors from neighboring partitions.dataset_mapping.bin), the reordered dataset (reordered_dataset.bin) and the optimized CAGRA graph (cagra_graph.bin) on disk. The index is then incomplete as show bycuvs::neighbors::index::on_disk(). The files are stored incuvs::neighbors::index::file_directory(). The HNSW index serialization was provided by @mfoerste4 in [WIP] Add disk2disk serialization foe ACE Algorithm #1410, which was merged here. This adds theserialize_to_hnsw()serialization routine that allows combination of dataset, graph, and mapping. The data will be combined on-the-fly while streamed from disk to disk while trying to minimize the required host memory. The host needs enough memory to hold the index though.Core Components
ace_build(): Main routine which users should call.ace_get_partition_labels(): Performs balanced k-means clustering to assign each vector to two closest partitions while handling small partition merging.ace_create_forward_and_backward_lists(): Creates bidirectional ID mappings between original dataset indices and reordered partition-local indices.ace_set_index_params(): Set the index parameters based on the partition and augmented dataset to ensure an efficient KNN graph building.ace_gather_partition_dataset(): In-memory only: gather the partition and augmented dataset.ace_adjust_sub_graph_ids: In-memory only: Adjust ids in sub search graph and store them into the main search graph.ace_adjust_final_graph_ids: In-memory only: Map graph neighbor IDs from reordered space back to original vector IDs.ace_reorder_and_store_dataset: Disk only: Reorder the dataset based on partitions and store to disk. Uses write buffers to improve performance.ace_load_partition_dataset_from_disk: Disk only: Load partition dataset and augmented dataset from disk.file_descriptorandace_read_large_file()/ace_write_large_file(): RAII file handle and chunked file I/O operations.on_disk_flag andfile_directory_to the CAGRA index structure to support disk-backed indices.ace_npartitionsandace_build_dirto the CAGRA parameters for users to specify that ACE should be used and which directory should be used if required.Usage
C++ API
Storage Requirements
cagra_graph.bin:n_rows * graph_degree * sizeof(IdxT)dataset_mapping.bin:n_rows * sizeof(IdxT)reordered_dataset.bin: Size of the input datasetaugmented_dataset.bin: Size of the input dataset