Evaluation of Code Similarity Search Strategies in Large-Scale Codebases

TL;DR — We systematically benchmark 6 approximate nearest-neighbour engines (Annoy, Elasticsearch, FAISS, HNSWlib, ScaNN, scikit-learn) paired with both TF-IDF and CodeBERT embeddings on the BigCloneBench dataset. Key finding: FAISS offers the best scalability; CodeBERT gives the highest semantic accuracy; Elasticsearch leads on raw query speed.

📖 Abstract

This repository accompanies the Springer TLDKS LVII paper Evaluation of Code Similarity Search Strategies in Large-Scale Codebases and provides reproducible scripts for benchmarking source-code retrieval pipelines at scale. We evaluate lexical and semantic representations (TF-IDF and CodeBERT) combined with multiple ANN/vector-search engines and report indexing cost, query efficiency, and practical trade-offs for large software corpora.

🏗️ Architecture

Raw JSONL ──► Vectorization ──► Index Construction ──► Query Benchmarking ──► Results & Plots
                (TF-IDF /           (FAISS / Annoy /        (latency, throughput,
                CodeBERT)           HNSW / SKLNN /           relevance metrics)
                                    Elasticsearch /
                                    ScaNN)

🛠️ Methods Evaluated

Method	Type	Description
Annoy	Tree-based	Approximate nearest-neighbor search with tree-based partitioning.
Elasticsearch	Inverted Index	Vector and text-based search using tunable scoring.
FAISS	Clustering/Quantization	Facebook AI Similarity Search; efficient high-dimensional search.
HNSW	Graph-based	Hierarchical Navigable Small World graphs.
ScaNN	Quantization	Google's Scalable Nearest Neighbors with partitioning.
SKLNN	Brute/Tree	scikit-learn nearest-neighbor algorithms.

📊 Key Results

Finding	Summary
Semantic quality	CodeBERT improves semantic relevance versus TF-IDF across backends.
Query speed	Elasticsearch provides the best raw query latency in the reported experiments.
Scalability	FAISS offers the strongest scalability profile for very large indexes.
Simplicity	scikit-learn is a strong baseline for small/medium datasets.

📂 Repository Structure

.
├── .github/workflows/ci.yml            # CI smoke tests and optional pytest
├── all.py                              # End-to-end benchmark + plots
├── indexing.py                         # Indexing-time benchmark
├── performance.py                      # Latency/QPS benchmark
├── plots.py                            # Benchmark plotting utility
├── testcodebert.py                     # TF-IDF vs CodeBERT benchmark
├── examples/quickstart.py              # Self-contained quickstart demo
├── src/similarity_search/utils.py      # Shared benchmark/data helpers
├── tests/test_utils.py                 # Unit tests for shared helpers
├── docs/ARCHITECTURE.md                # System architecture details
├── docs/RESULTS.md                     # Reproducible results summary
├── requirements.txt                    # Reproducible runtime dependencies
├── pyproject.toml                      # Modern packaging metadata
├── CONTRIBUTING.md                     # Contribution guide
└── CHANGELOG.md                        # Release history

⚡ Quick Start

python -m pip install -r requirements.txt
python examples/quickstart.py

🚀 Full Reproducibility Guide

Prerequisites

Python 3.8+
BigCloneBench JSONL data (CodeXGLUE format)
Optional: Elasticsearch 8.x for Elasticsearch experiments

Installation

pip install -e .
# or
pip install -r requirements.txt

Dataset Setup

Download BigCloneBench from CodeXGLUE.
Place the file at data/data.jsonl.
Ensure each line has JSON with a func field.

Running Experiments

python all.py --data data/data.jsonl --k 3 --trials 100
python indexing.py --data data/data.jsonl --sizes 100 1000 10000
python performance.py --data data/data.jsonl --query "for(int i=0;i<n;i++){sum+=i;}"
python plots.py --data data/data.jsonl --time-plot search_time_comparison.tex
python testcodebert.py --data data/data.jsonl --trials 100

Elasticsearch Setup

docker run -p 9200:9200 -e discovery.type=single-node -e xpack.security.enabled=false docker.elastic.co/elasticsearch/elasticsearch:8.14.0

For authenticated setups, copy .env.example to .env and configure ES_HOST, ES_USER, and ES_PASSWORD.

🔬 Extending the Benchmark

Add backend-specific index/search functions in benchmark scripts.
Reuse read_code_snippets and benchmark_search from src/similarity_search/utils.py.
Add reproducible output and update docs/results tables.
Include tests for any shared utilities.

📝 Citation

@incollection{martinez2024evaluation,
  title={Evaluation of Code Similarity Search Strategies in Large-Scale Codebases},
  author={Martinez-Gil, Jorge and Yin, Shaoyi},
  booktitle={Transactions on Large-Scale Data-and Knowledge-Centered Systems LVII},
  pages={99--113},
  year={2024},
  publisher={Springer}
}

🤝 Contributing

See CONTRIBUTING.md.

📄 License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of Code Similarity Search Strategies in Large-Scale Codebases

📖 Abstract

🏗️ Architecture

🛠️ Methods Evaluated

📊 Key Results

📂 Repository Structure

⚡ Quick Start

🚀 Full Reproducibility Guide

Prerequisites

Installation

Dataset Setup

Running Experiments

Elasticsearch Setup

🔬 Extending the Benchmark

📝 Citation

🤝 Contributing

📄 License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
individual		individual
src/similarity_search		src/similarity_search
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
all.py		all.py
indexing.py		indexing.py
performance.py		performance.py
plots.py		plots.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
testcodebert.py		testcodebert.py

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Code Similarity Search Strategies in Large-Scale Codebases

📖 Abstract

🏗️ Architecture

🛠️ Methods Evaluated

📊 Key Results

📂 Repository Structure

⚡ Quick Start

🚀 Full Reproducibility Guide

Prerequisites

Installation

Dataset Setup

Running Experiments

Elasticsearch Setup

🔬 Extending the Benchmark

📝 Citation

🤝 Contributing

📄 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages