This repository contains implementations and benchmarks for my thesis on Approximate Nearest Neighbor Search (ANNS). ANN graphs are used to efficiently find approximate nearest neighbors in high-dimensional spaces, making them useful for applications like recommendation systems, image retrieval, clustering, and NLP.
Description: The foundational ANN graph algorithm using a multi-layer hierarchical structure with small-world navigation.
Key Features:
- Multi-layer construction with probability-based level assignment
- Greedy search on upper layers, beam search on bottom layer
- Parameters:
M(connections per node),efConstruction,efSearch
Files:
hnsw_construction.py- Pure Python implementationhnsw_cpp/- C++ optimized implementation
Description: Early termination strategy using history features to predict when search can safely stop.
Key Features:
- Monitors search history features (distance statistics, iteration count)
- Uses a predictor to estimate termination confidence
- Parameters:
Rt(re-termination threshold),ipi(initial prediction interval),mpi(minimum prediction interval)
Files:
hsnw_constructionDARTH.py- Python implementationhnswDarth_cpp/- C++ optimized implementation
Description: Saturation-based early termination that monitors when the top-k result set stops changing.
Key Features:
- Tracks overlap between consecutive iterations: [φ_h,l(q) = 100 × |N_{h-1,l}(q) ∩ N_{h,l}(q)| / k]
- Stops when [φ \ge \gamma ] for Δ consecutive iterations
- Parameters:
pip_gamma(γ, saturation threshold),pip_delta(Δ, patience)
Files:
hnsw_pip.py- Python implementation
Description: Adapts the exploration factor based on estimated query difficulty using the Fundamental Distributional assumption (FDL).
Key Features:
- Offline phase: Builds ef-estimation table from sampled queries
- Online phase: Estimates query difficulty and looks up appropriate ef
- Parameters:
target_recall,adaef_bins,adaef_delta
Files:
hnsw_adaef.py- Python implementation
The C++ implementations provide significant speedups for index construction and search. Before running any notebook or benchmark, you must build the C++ modules.
cd hnsw_cpp/src
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O3 -DNDEBUG -march=native -ffast-math" ..
cmake --build . -j
cd ../../..cd hnswDarth_cpp/src
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O3 -DNDEBUG -march=native -ffast-math" ..
cmake --build . -j
cd ../../..| Flag | Description |
|---|---|
-O3 |
Maximum optimization |
-DNDEBUG |
Disable assertions |
-march=native |
CPU-specific optimizations |
-ffast-math |
Fast floating-point math |
-j |
Parallel build (uses all cores) |
approximate-nearest-neighbor-graphs/
├── hnsw_construction.py # HNSW Python implementation
├── hsnw_constructionDARTH.py # DARTH Python implementation
├── hnsw_pip.py # PiP Python implementation
├── hnsw_adaef.py # Ada-ef Python implementation
├── hnsw_cpp/ # HNSW C++ source
├── hnswDarth_cpp/ # DARTH C++ source
├── testing/ # Testing and comparison scripts
├── utils/ # Utility functions
├── metrics/ # Benchmarking metrics
├── notebooks/ # Jupyter notebooks
├── results_csv/ # CSV results
├── plot_results/ # Generated plots
├── Datasets/ # Dataset files
└── README.md # This file
- Python 3.8 or higher
- CMake 3.15+ (for C++ modules)
- C++ compiler with C++17 support
- Required libraries (see
requirements.txt)
-
Clone the repository:
git clone https://github.com/your-username/approximate-nearest-neighbor-graphs.git cd approximate-nearest-neighbor-graphs -
Install Python dependencies:
pip install -r requirements.txt
-
Build C++ modules (see C++ Build Setup above)
-
Download or prepare your dataset in
Datasets/
jupyter notebookThen navigate to notebooks/ and run the notebooks in order:
01_hnsw_baseline.ipynb- Baseline HNSW02_darth.ipynb- DARTH evaluation03_pip.ipynb- PiP evaluation04_adaef.ipynb- Ada-ef evaluation05_unified_comparison.ipynb- Unified comparison06_query_difficulty.ipynb- Query difficulty analysis
# Run benchmarks
python testing/run_comparison.py
# Generate plots
python plot_creation_csv.pyThe project uses ANN-benchmark format datasets:
- siftsmall: Small SIFT subset (default)
- sift: Full SIFT dataset
- gist: GIST descriptors
- glove: GloVe embeddings
Place datasets in Datasets/<dataset_name>/ with files:
<name>_base.fvecs- Base vectors<name>_query.fvecs- Query vectors<name>_groundtruth.ivecs- Ground truth
Or use the download scripts in scripts/.
The following metrics are computed:
| Metric | Description |
|---|---|
| Recall@K | Fraction of true neighbors found in top-K |
| QPS | Queries per second |
| Latency | Time per query (ms) |
| Build Time | Index construction time |
| Algorithm | Early Termination | Adaptive | Offline Phase | Best For |
|---|---|---|---|---|
| HNSW | No | No | No | Baseline, simple use cases |
| DARTH | Yes | Yes | Optional | Reducing effort on easy queries |
| PiP | Yes | No | No | Simple early stopping |
| Ada-ef | Yes | Yes | Yes | Consistent recall targets |
This thesis investigates:
-
Trade-offs: How do different early termination strategies affect the recall-efficiency trade-off?
-
Query Difficulty: Why are some queries inherently harder than others, and can we identify them?
-
Adaptation: Can adaptive methods that adjust to query difficulty outperform fixed-parameter methods?
-
Stability: Which methods provide consistent performance across different query types?
The C++ module hasn't been built. Run the build commands in the C++ Build Setup section above.
The DARTH C++ module hasn't been built. Run the DARTH build commands above.
- Reduce dataset size
- Use a machine with more RAM
- Process in batches
- Increase
efSearch - Adjust adaptive method parameters
- Check metric compatibility (some methods use cosine by default)
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License. See the LICENSE file for details.
This work is part of my thesis on Approximate Nearest Neighbor Search. Special thanks to:
- The original HNSW paper authors
- Authors of the DARTH, PiP, and Ada-ef papers
- The ANN-benchmark community for datasets