Reranker Leaderboard

Evaluation of 12 reranking models using LLM-as-judge pairwise comparisons across 6 datasets.

Leaderboard

Rank	Model	ELO	Win Rate
1	Zerank 2	1638	57%
2	Cohere Rerank 4 Pro	1629	58%
3	Zerank 1	1573	57%
4	Voyage AI Rerank 2.5	1544	58%
5	Zerank 1 Small	1539	55%
6	Voyage AI Rerank 2.5 Lite	1520	53%
7	Cohere Rerank 4 Fast	1510	50%
8	Qwen3 Reranker 8B	1473	51%
9	Contextual AI Rerank v2 Instruct	1469	42%
10	Cohere Rerank 3.5	1451	41%
11	BAAI/BGE Reranker v2 M3	1327	29%
12	Jina Reranker v2 Base Multilingual	1327	28%

Datasets

MSMARCO (web search)
Arguana (argument mining)
FiQa (financial Q&A)
Business Reports
Paul Graham Essays
DBPedia (entity retrieval)

Methodology

Embed documents with BAAI/bge-small-en-v1.5
Retrieve top-50 candidates using FAISS
Rerank to top-15 with each model
Generate pairwise judgments using GPT-5
Calculate ELO ratings from pairwise comparisons

Usage

See eval-pipeline/ADD_NEW_RERANKER_GUIDE.md for instructions on adding new rerankers.

Project Structure

eval-pipeline/
├── config.yaml              Configuration
├── pipeline/                Evaluation pipeline
│   └── stages/
│       ├── embed.py         Document embedding
│       ├── retrieve.py      FAISS retrieval
│       ├── rerank.py        Reranker integrations
│       └── llm-judge.py     LLM-as-judge evaluation
├── add-reranker.py          Add new reranker
├── compare-rerankers.py     Compare all rerankers (ELO)
└── aggregate-all-results.py Aggregate cross-dataset results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reranker Leaderboard

Leaderboard

Datasets

Methodology

Usage

Project Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Reranker Leaderboard

Leaderboard

Datasets

Methodology

Usage

Project Structure