Skip to content

[FEA] CAGRA -> multi layer hnswlib (reduce memory and intermediate files needed) with NO dataset in the final output #1482

@pmiloslavsky

Description

@pmiloslavsky

Is your feature request related to a problem? Please describe.

Request for a C++ API that produces a multilayer (hierarchy CPU) ascii graph text file (hnswlib multi layer format – so not a cagra single layer) from a cagra index with the least possible use of intermediate memory and intermediate files and no inclusion of the dataset in the final output.
Binary file final output OK, but strongly prefer ASCII text.

The problem is to pass embeddings to cagra, and get back graph information that's compatible with hnswlib multi layer. The actual data will remain in the database from where it was passed to cagra. The native cagra search would not be used. Once we have the graph information, we would store it back in the database (the neighbor index integer is a known association with data in the database) and search it using our own methods.

Anyway you can get this done is great. Iterator approach is fine, but I dont see how an iterator would work. Ignore the text file details if it bothers you.

Describe the solution you'd like

Final text file option:
This has to have code similar to from_cagra():

std::enable_if_t<hierarchy == HnswHierarchy::CPU, std::unique_ptr<index<T>>> from_cagra(

With the addition of :

void make_no_dataset_txt_file(const std::string output_path, hnswlib::HierarchicalNSW<float>* 
index) { 
    std::ofstream out(output_path); 
    if (!out.is_open()) { 
        std::cerr << "Failed to open graph.txt for writing" << std::endl; 
        return; 
    } 
 
    out << "ef "<< index -> ef_construction_ << " M " << index ->M_ << " EntryPoint " << index->getExternalLabel(index -> enterpoint_node_) << " MaxLayer " << index->maxlevel_ << " MaxElements " 
<< index -> max_elements_ << "\n" ; 
    const hnswlib::tableint num_nodes = static_cast<hnswlib::tableint>(index->getCurrentElementCount()); 
    for (hnswlib::tableint i = 0; i < num_nodes; i++) { 
        int node_max_level = index->element_levels_[i]; 
        for (int level = 0; level <= node_max_level; level++) { 
            hnswlib::linklistsizeint *ll_cur = index->get_linklist_at_level(i, level); 
            int size = index->getListCount(ll_cur); 
            hnswlib::tableint *neighbors = reinterpret_cast<hnswlib::tableint *>(ll_cur + 1); 
 
            out << index->getExternalLabel(i) << " " << level; 
            for (int j = 0; j < size; j++) { 
                out << " " << index->getExternalLabel(neighbors[j]); 
            } 
            out << "\n"; 
        } 
    } 
    out.close(); 
    return; 
} 

at the end.

The API could be called from_cagra_serialize_to_dataset_free_ascii_text(textfilename)
It could run on the GPU or the CPU.

The final product should look like this:

ubuntu@ip-172-31-30-158:/opt/iris/NGPU/mgr/user$ more cagra-hnswlib-serialization-
multi-layer-M16-EFC32-nrows2000000-ndim768-fp32.txt 
ef 32 M 16 EntryPoint 1109932 MaxLayer 6 MaxElements 2000000 
<graph neighbors> 
1250000 0 50000 866726 1371004 207068 879421 1599652 1555841 938599 1083443 
927946 1004281 498189 983055 1406649 1928619 139247 351203 117606 2 983028 
1551203 1837023 637023 206649 1716080 516080 833543 1836302 636302 1454374 
254374 938632 1978668 
</graph neighbors> 
... 
2000000 total lines 

C++ Iterator option:
Please also include a discussion on if some kind of iterator over neighbors is possible as
that was our first thought.
One call of the iterator (if its possible) could result in something like this (bold part):

ubuntu@ip-172-31-30-158:/opt/iris/NGPU/mgr/user$ more cagra-hnswlib-serialization-
multi-layer-M16-EFC32-nrows2000000-ndim768-fp32.txt
ef 32 M 16 EntryPoint 1109932 MaxLayer 6 MaxElements 2000000

<iterator return> 
1250000 0 50000 866726 1371004 207068 879421 1599652 1555841 938599 1083443 
927946 1004281 498189 983055 1406649 1928619 139247 351203 1176062 983028 
1551203 1837023 637023 206649 1716080 516080 833543 1836302 636302 1454374 
254374 938632 1978668 
</iterator return> 

Describe alternatives you've considered
Existing APIs can do this, but create large intermediate files and use lots of CPU memory (but much less actual CPU time than an hnswlib solution)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions