- Existing embeddings and retrieval results are reused
- Only new pairwise judgments are computed
- ELO ratings recalculated from all judgments
Add to config.yaml:
rerankers:
- name: "your-reranker"
type: "your-type"
model: "model-name"
api_key_env: "YOUR_API_KEY"
top_k: 15export YOUR_API_KEY="your-key"cd eval-pipeline
for dataset in msmarco arguana fiqa_small business-reports pg dbpedia scifact; do
python add-reranker.py --dataset $dataset --reranker-name "your-reranker" --skip-evaluate
donefor dataset in msmarco arguana fiqa_small business-reports pg dbpedia scifact; do
python compare-rerankers.py --dataset $dataset
donepython aggregate-all-results.pyResults saved to benchmarks.json.
# Check reranked files
ls runs/*/*/rerank/reranked_your-reranker.jsonl
# Check ELO ratings
cat runs/msmarco/*/llm_judge/elo_leaderboard.csv
# Check final results
cat benchmarks.json | jq '.[] | select(.name == "your-reranker")'Reused:
- Embeddings (BAAI/bge-small-en-v1.5)
- Retrieval results (top-50 per query)
- Existing pairwise judgments
New:
- Your reranker's results
- Pairwise judgments: your-reranker vs each existing reranker
Recalculated:
- ELO ratings from all judgments
- 7 datasets: msmarco, arguana, fiqa_small, business-reports, pg, dbpedia, scifact
- LLM judge: Azure OpenAI GPT-5
- 50 queries per dataset
- ~550 new judgments per dataset (11 existing rerankers)