Approximate Nearest Neighbor Graphs

Overview

This repository contains implementations and benchmarks for my thesis on Approximate Nearest Neighbor Search (ANNS). ANN graphs are used to efficiently find approximate nearest neighbors in high-dimensional spaces, making them useful for applications like recommendation systems, image retrieval, clustering, and NLP.

Implemented Algorithms

1. HNSW (Hierarchical Navigable Small World)

Description: The foundational ANN graph algorithm using a multi-layer hierarchical structure with small-world navigation.

Key Features:

Multi-layer construction with probability-based level assignment
Greedy search on upper layers, beam search on bottom layer
Parameters: M (connections per node), efConstruction, efSearch

Files:

hnsw_construction.py - Pure Python implementation
hnsw_cpp/ - C++ optimized implementation

2. DARTH (Dynamic Adaptive Re-termination)

Description: Early termination strategy using history features to predict when search can safely stop.

Key Features:

Monitors search history features (distance statistics, iteration count)
Uses a predictor to estimate termination confidence
Parameters: Rt (re-termination threshold), ipi (initial prediction interval), mpi (minimum prediction interval)

Files:

hsnw_constructionDARTH.py - Python implementation
hnswDarth_cpp/ - C++ optimized implementation

3. PiP (Patience in Proximity)

Description: Saturation-based early termination that monitors when the top-k result set stops changing.

Key Features:

Tracks overlap between consecutive iterations: [φ_h,l(q) = 100 × |N_{h-1,l}(q) ∩ N_{h,l}(q)| / k]
Stops when [φ \ge \gamma ] for Δ consecutive iterations
Parameters: pip_gamma (γ, saturation threshold), pip_delta (Δ, patience)

Files:

hnsw_pip.py - Python implementation

4. Ada-ef (Adaptive Exploration Factor)

Description: Adapts the exploration factor based on estimated query difficulty using the Fundamental Distributional assumption (FDL).

Key Features:

Offline phase: Builds ef-estimation table from sampled queries
Online phase: Estimates query difficulty and looks up appropriate ef
Parameters: target_recall, adaef_bins, adaef_delta

Files:

hnsw_adaef.py - Python implementation

C++ Build Setup

The C++ implementations provide significant speedups for index construction and search. Before running any notebook or benchmark, you must build the C++ modules.

Build Instructions

HNSW C++ Module

cd hnsw_cpp/src
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O3 -DNDEBUG -march=native -ffast-math" ..
cmake --build . -j
cd ../../..

DARTH C++ Module

cd hnswDarth_cpp/src
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O3 -DNDEBUG -march=native -ffast-math" ..
cmake --build . -j
cd ../../..

Build Options

Flag	Description
`-O3`	Maximum optimization
`-DNDEBUG`	Disable assertions
`-march=native`	CPU-specific optimizations
`-ffast-math`	Fast floating-point math
`-j`	Parallel build (uses all cores)

Project Structure

approximate-nearest-neighbor-graphs/
├── hnsw_construction.py      # HNSW Python implementation
├── hsnw_constructionDARTH.py # DARTH Python implementation
├── hnsw_pip.py               # PiP Python implementation
├── hnsw_adaef.py             # Ada-ef Python implementation
├── hnsw_cpp/                 # HNSW C++ source
├── hnswDarth_cpp/            # DARTH C++ source
├── testing/                  # Testing and comparison scripts
├── utils/                    # Utility functions
├── metrics/                  # Benchmarking metrics
├── notebooks/                # Jupyter notebooks
├── results_csv/              # CSV results
├── plot_results/             # Generated plots
├── Datasets/                 # Dataset files
└── README.md                # This file

Getting Started

Prerequisites

Python 3.8 or higher
CMake 3.15+ (for C++ modules)
C++ compiler with C++17 support
Required libraries (see requirements.txt)

Installation

Clone the repository:

git clone https://github.com/your-username/approximate-nearest-neighbor-graphs.git
cd approximate-nearest-neighbor-graphs

Install Python dependencies:
```
pip install -r requirements.txt
```
Build C++ modules (see C++ Build Setup above)
Download or prepare your dataset in Datasets/

Running Experiments

Using Jupyter Notebooks (Recommended)

jupyter notebook

Then navigate to notebooks/ and run the notebooks in order:

01_hnsw_baseline.ipynb - Baseline HNSW
02_darth.ipynb - DARTH evaluation
03_pip.ipynb - PiP evaluation
04_adaef.ipynb - Ada-ef evaluation
05_unified_comparison.ipynb - Unified comparison
06_query_difficulty.ipynb - Query difficulty analysis

Using Python Scripts

# Run benchmarks
python testing/run_comparison.py

# Generate plots
python plot_creation_csv.py

Datasets

The project uses ANN-benchmark format datasets:

siftsmall: Small SIFT subset (default)
sift: Full SIFT dataset
gist: GIST descriptors
glove: GloVe embeddings

Place datasets in Datasets/<dataset_name>/ with files:

<name>_base.fvecs - Base vectors
<name>_query.fvecs - Query vectors
<name>_groundtruth.ivecs - Ground truth

Or use the download scripts in scripts/.

Benchmark Metrics

The following metrics are computed:

Metric	Description
Recall@K	Fraction of true neighbors found in top-K
QPS	Queries per second
Latency	Time per query (ms)
Build Time	Index construction time

Algorithm Comparison

Algorithm	Early Termination	Adaptive	Offline Phase	Best For
HNSW	No	No	No	Baseline, simple use cases
DARTH	Yes	Yes	Optional	Reducing effort on easy queries
PiP	Yes	No	No	Simple early stopping
Ada-ef	Yes	Yes	Yes	Consistent recall targets

Key Research Questions

This thesis investigates:

Trade-offs: How do different early termination strategies affect the recall-efficiency trade-off?
Query Difficulty: Why are some queries inherently harder than others, and can we identify them?
Adaptation: Can adaptive methods that adjust to query difficulty outperform fixed-parameter methods?
Stability: Which methods provide consistent performance across different query types?

Troubleshooting

ImportError: No module named 'hnsw_cpp'

The C++ module hasn't been built. Run the build commands in the C++ Build Setup section above.

ImportError: No module named 'hnswDarth_cpp'

The DARTH C++ module hasn't been built. Run the DARTH build commands above.

Out of Memory

Reduce dataset size
Use a machine with more RAM
Process in batches

Low Recall

Increase efSearch
Adjust adaptive method parameters
Check metric compatibility (some methods use cosine by default)

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

This work is part of my thesis on Approximate Nearest Neighbor Search. Special thanks to:

The original HNSW paper authors
Authors of the DARTH, PiP, and Ada-ef papers
The ANN-benchmark community for datasets

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
Datasets/siftsmall		Datasets/siftsmall
HNSWDARTH_comparison_plots		HNSWDARTH_comparison_plots
benchmark		benchmark
benchmark_results		benchmark_results
darth		darth
demos		demos
experiments		experiments
hnswDarth_cpp		hnswDarth_cpp
hnsw_adaef_cpp		hnsw_adaef_cpp
hnsw_cpp		hnsw_cpp
hnsw_pip_cpp		hnsw_pip_cpp
metrics		metrics
milestone_report		milestone_report
notebooks		notebooks
papers		papers
plot_fashion_mnist		plot_fashion_mnist
plot_glove100		plot_glove100
plot_msmarco_text		plot_msmarco_text
plot_results		plot_results
plot_sift128		plot_sift128
plot_siftsmall		plot_siftsmall
predictor_models		predictor_models
real-world-case-scenario		real-world-case-scenario
report		report
results_csv		results_csv
scripts		scripts
testing		testing
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
DARTH_PREDICTOR_GUIDE.md		DARTH_PREDICTOR_GUIDE.md
DATASETS.md		DATASETS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTES_HNSW_PAPER.md		NOTES_HNSW_PAPER.md
PLOTS_GUIDE.md		PLOTS_GUIDE.md
README.md		README.md
__init__.py		__init__.py
hnsw_adaef.py		hnsw_adaef.py
hnsw_construction.py		hnsw_construction.py
hnsw_pip.py		hnsw_pip.py
hsnw_constructionDARTH.py		hsnw_constructionDARTH.py
plot_creation_csv.py		plot_creation_csv.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Approximate Nearest Neighbor Graphs

Overview

Implemented Algorithms

1. HNSW (Hierarchical Navigable Small World)

2. DARTH (Dynamic Adaptive Re-termination)

3. PiP (Patience in Proximity)

4. Ada-ef (Adaptive Exploration Factor)

C++ Build Setup

Build Instructions

HNSW C++ Module

DARTH C++ Module

Build Options

Project Structure

Getting Started

Prerequisites

Installation

Running Experiments

Using Jupyter Notebooks (Recommended)

Using Python Scripts

Datasets

Benchmark Metrics

Algorithm Comparison

Key Research Questions

Troubleshooting

ImportError: No module named 'hnsw_cpp'

ImportError: No module named 'hnswDarth_cpp'

Out of Memory

Low Recall

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages