- 
                Notifications
    
You must be signed in to change notification settings  - Fork 140
 
Description
Is your feature request related to a problem? Please describe.
Request for a C++ API that produces a multilayer (hierarchy CPU) ascii graph text file (hnswlib multi layer format – so not a cagra single layer) from a cagra index with the least possible use of intermediate memory and intermediate files and no inclusion of the dataset in the final output.
Binary file final output OK, but strongly prefer ASCII text.
The problem is to pass embeddings to cagra, and get back graph information that's compatible with hnswlib multi layer. The actual data will remain in the database from where it was passed to cagra. The native cagra search would not be used. Once we have the graph information, we would store it back in the database (the neighbor index integer is a known association with data in the database) and search it using our own methods.
Anyway you can get this done is great. Iterator approach is fine, but I dont see how an iterator would work. Ignore the text file details if it bothers you.
Describe the solution you'd like
Final text file option:
This has to have code similar to from_cagra():
cuvs/cpp/src/neighbors/detail/hnsw.hpp
Line 155 in 815d86d
| std::enable_if_t<hierarchy == HnswHierarchy::CPU, std::unique_ptr<index<T>>> from_cagra( | 
With the addition of :
void make_no_dataset_txt_file(const std::string output_path, hnswlib::HierarchicalNSW<float>* 
index) { 
    std::ofstream out(output_path); 
    if (!out.is_open()) { 
        std::cerr << "Failed to open graph.txt for writing" << std::endl; 
        return; 
    } 
 
    out << "ef "<< index -> ef_construction_ << " M " << index ->M_ << " EntryPoint " << index->getExternalLabel(index -> enterpoint_node_) << " MaxLayer " << index->maxlevel_ << " MaxElements " 
<< index -> max_elements_ << "\n" ; 
    const hnswlib::tableint num_nodes = static_cast<hnswlib::tableint>(index->getCurrentElementCount()); 
    for (hnswlib::tableint i = 0; i < num_nodes; i++) { 
        int node_max_level = index->element_levels_[i]; 
        for (int level = 0; level <= node_max_level; level++) { 
            hnswlib::linklistsizeint *ll_cur = index->get_linklist_at_level(i, level); 
            int size = index->getListCount(ll_cur); 
            hnswlib::tableint *neighbors = reinterpret_cast<hnswlib::tableint *>(ll_cur + 1); 
 
            out << index->getExternalLabel(i) << " " << level; 
            for (int j = 0; j < size; j++) { 
                out << " " << index->getExternalLabel(neighbors[j]); 
            } 
            out << "\n"; 
        } 
    } 
    out.close(); 
    return; 
} 
at the end.
The API could be called from_cagra_serialize_to_dataset_free_ascii_text(textfilename)
It could run on the GPU or the CPU.
The final product should look like this:
ubuntu@ip-172-31-30-158:/opt/iris/NGPU/mgr/user$ more cagra-hnswlib-serialization-
multi-layer-M16-EFC32-nrows2000000-ndim768-fp32.txt 
ef 32 M 16 EntryPoint 1109932 MaxLayer 6 MaxElements 2000000 
<graph neighbors> 
1250000 0 50000 866726 1371004 207068 879421 1599652 1555841 938599 1083443 
927946 1004281 498189 983055 1406649 1928619 139247 351203 117606 2 983028 
1551203 1837023 637023 206649 1716080 516080 833543 1836302 636302 1454374 
254374 938632 1978668 
</graph neighbors> 
... 
2000000 total lines 
C++ Iterator option:
Please also include a discussion on if some kind of iterator over neighbors is possible as
that was our first thought.
One call of the iterator (if its possible) could result in something like this (bold part):
ubuntu@ip-172-31-30-158:/opt/iris/NGPU/mgr/user$ more cagra-hnswlib-serialization-
multi-layer-M16-EFC32-nrows2000000-ndim768-fp32.txt
ef 32 M 16 EntryPoint 1109932 MaxLayer 6 MaxElements 2000000
<iterator return> 
1250000 0 50000 866726 1371004 207068 879421 1599652 1555841 938599 1083443 
927946 1004281 498189 983055 1406649 1928619 139247 351203 1176062 983028 
1551203 1837023 637023 206649 1716080 516080 833543 1836302 636302 1454374 
254374 938632 1978668 
</iterator return> 
Describe alternatives you've considered
Existing APIs can do this, but create large intermediate files and use lots of CPU memory (but much less actual CPU time than an hnswlib solution)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status