Skip to content

Latest commit

 

History

History
88 lines (62 loc) · 1.69 KB

File metadata and controls

88 lines (62 loc) · 1.69 KB

Adding a New Reranker

Prerequisites

  • Existing embeddings and retrieval results are reused
  • Only new pairwise judgments are computed
  • ELO ratings recalculated from all judgments

Steps

1. Update Configuration

Add to config.yaml:

rerankers:
  - name: "your-reranker"
    type: "your-type"
    model: "model-name"
    api_key_env: "YOUR_API_KEY"
    top_k: 15

2. Set API Key

export YOUR_API_KEY="your-key"

3. Run Reranking

cd eval-pipeline

for dataset in msmarco arguana fiqa_small business-reports pg dbpedia scifact; do
  python add-reranker.py --dataset $dataset --reranker-name "your-reranker" --skip-evaluate
done

4. Compare Rerankers

for dataset in msmarco arguana fiqa_small business-reports pg dbpedia scifact; do
  python compare-rerankers.py --dataset $dataset
done

5. Aggregate Results

python aggregate-all-results.py

Results saved to benchmarks.json.

Verification

# Check reranked files
ls runs/*/*/rerank/reranked_your-reranker.jsonl

# Check ELO ratings
cat runs/msmarco/*/llm_judge/elo_leaderboard.csv

# Check final results
cat benchmarks.json | jq '.[] | select(.name == "your-reranker")'

What Gets Computed

Reused:

  • Embeddings (BAAI/bge-small-en-v1.5)
  • Retrieval results (top-50 per query)
  • Existing pairwise judgments

New:

  • Your reranker's results
  • Pairwise judgments: your-reranker vs each existing reranker

Recalculated:

  • ELO ratings from all judgments

Notes

  • 7 datasets: msmarco, arguana, fiqa_small, business-reports, pg, dbpedia, scifact
  • LLM judge: Azure OpenAI GPT-5
  • 50 queries per dataset
  • ~550 new judgments per dataset (11 existing rerankers)