Skip to content

Latest commit

 

History

History
58 lines (46 loc) · 1.78 KB

File metadata and controls

58 lines (46 loc) · 1.78 KB

Reranker Leaderboard

Evaluation of 12 reranking models using LLM-as-judge pairwise comparisons across 6 datasets.

Leaderboard

Rank Model ELO Win Rate
1 Zerank 2 1638 57%
2 Cohere Rerank 4 Pro 1629 58%
3 Zerank 1 1573 57%
4 Voyage AI Rerank 2.5 1544 58%
5 Zerank 1 Small 1539 55%
6 Voyage AI Rerank 2.5 Lite 1520 53%
7 Cohere Rerank 4 Fast 1510 50%
8 Qwen3 Reranker 8B 1473 51%
9 Contextual AI Rerank v2 Instruct 1469 42%
10 Cohere Rerank 3.5 1451 41%
11 BAAI/BGE Reranker v2 M3 1327 29%
12 Jina Reranker v2 Base Multilingual 1327 28%

Datasets

  • MSMARCO (web search)
  • Arguana (argument mining)
  • FiQa (financial Q&A)
  • Business Reports
  • Paul Graham Essays
  • DBPedia (entity retrieval)

Methodology

  1. Embed documents with BAAI/bge-small-en-v1.5
  2. Retrieve top-50 candidates using FAISS
  3. Rerank to top-15 with each model
  4. Generate pairwise judgments using GPT-5
  5. Calculate ELO ratings from pairwise comparisons

Usage

See eval-pipeline/ADD_NEW_RERANKER_GUIDE.md for instructions on adding new rerankers.

Project Structure

eval-pipeline/
├── config.yaml              Configuration
├── pipeline/                Evaluation pipeline
│   └── stages/
│       ├── embed.py         Document embedding
│       ├── retrieve.py      FAISS retrieval
│       ├── rerank.py        Reranker integrations
│       └── llm-judge.py     LLM-as-judge evaluation
├── add-reranker.py          Add new reranker
├── compare-rerankers.py     Compare all rerankers (ELO)
└── aggregate-all-results.py Aggregate cross-dataset results