TL;DR — We systematically benchmark 6 approximate nearest-neighbour engines (Annoy, Elasticsearch, FAISS, HNSWlib, ScaNN, scikit-learn) paired with both TF-IDF and CodeBERT embeddings on the BigCloneBench dataset. Key finding: FAISS offers the best scalability; CodeBERT gives the highest semantic accuracy; Elasticsearch leads on raw query speed.
This repository accompanies the Springer TLDKS LVII paper Evaluation of Code Similarity Search Strategies in Large-Scale Codebases and provides reproducible scripts for benchmarking source-code retrieval pipelines at scale. We evaluate lexical and semantic representations (TF-IDF and CodeBERT) combined with multiple ANN/vector-search engines and report indexing cost, query efficiency, and practical trade-offs for large software corpora.
Raw JSONL ──► Vectorization ──► Index Construction ──► Query Benchmarking ──► Results & Plots
(TF-IDF / (FAISS / Annoy / (latency, throughput,
CodeBERT) HNSW / SKLNN / relevance metrics)
Elasticsearch /
ScaNN)
| Method | Type | Description |
|---|---|---|
| Annoy | Tree-based | Approximate nearest-neighbor search with tree-based partitioning. |
| Elasticsearch | Inverted Index | Vector and text-based search using tunable scoring. |
| FAISS | Clustering/Quantization | Facebook AI Similarity Search; efficient high-dimensional search. |
| HNSW | Graph-based | Hierarchical Navigable Small World graphs. |
| ScaNN | Quantization | Google's Scalable Nearest Neighbors with partitioning. |
| SKLNN | Brute/Tree | scikit-learn nearest-neighbor algorithms. |
| Finding | Summary |
|---|---|
| Semantic quality | CodeBERT improves semantic relevance versus TF-IDF across backends. |
| Query speed | Elasticsearch provides the best raw query latency in the reported experiments. |
| Scalability | FAISS offers the strongest scalability profile for very large indexes. |
| Simplicity | scikit-learn is a strong baseline for small/medium datasets. |
.
├── .github/workflows/ci.yml # CI smoke tests and optional pytest
├── all.py # End-to-end benchmark + plots
├── indexing.py # Indexing-time benchmark
├── performance.py # Latency/QPS benchmark
├── plots.py # Benchmark plotting utility
├── testcodebert.py # TF-IDF vs CodeBERT benchmark
├── examples/quickstart.py # Self-contained quickstart demo
├── src/similarity_search/utils.py # Shared benchmark/data helpers
├── tests/test_utils.py # Unit tests for shared helpers
├── docs/ARCHITECTURE.md # System architecture details
├── docs/RESULTS.md # Reproducible results summary
├── requirements.txt # Reproducible runtime dependencies
├── pyproject.toml # Modern packaging metadata
├── CONTRIBUTING.md # Contribution guide
└── CHANGELOG.md # Release history
python -m pip install -r requirements.txt
python examples/quickstart.py- Python 3.8+
- BigCloneBench JSONL data (CodeXGLUE format)
- Optional: Elasticsearch 8.x for Elasticsearch experiments
pip install -e .
# or
pip install -r requirements.txt- Download BigCloneBench from CodeXGLUE.
- Place the file at
data/data.jsonl. - Ensure each line has JSON with a
funcfield.
python all.py --data data/data.jsonl --k 3 --trials 100
python indexing.py --data data/data.jsonl --sizes 100 1000 10000
python performance.py --data data/data.jsonl --query "for(int i=0;i<n;i++){sum+=i;}"
python plots.py --data data/data.jsonl --time-plot search_time_comparison.tex
python testcodebert.py --data data/data.jsonl --trials 100docker run -p 9200:9200 -e discovery.type=single-node -e xpack.security.enabled=false docker.elastic.co/elasticsearch/elasticsearch:8.14.0For authenticated setups, copy .env.example to .env and configure ES_HOST, ES_USER, and ES_PASSWORD.
- Add backend-specific index/search functions in benchmark scripts.
- Reuse
read_code_snippetsandbenchmark_searchfromsrc/similarity_search/utils.py. - Add reproducible output and update docs/results tables.
- Include tests for any shared utilities.
@incollection{martinez2024evaluation,
title={Evaluation of Code Similarity Search Strategies in Large-Scale Codebases},
author={Martinez-Gil, Jorge and Yin, Shaoyi},
booktitle={Transactions on Large-Scale Data-and Knowledge-Centered Systems LVII},
pages={99--113},
year={2024},
publisher={Springer}
}See CONTRIBUTING.md.
MIT. See LICENSE.