Skip to content

jonathan-politzki/mite-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MITE: Massive Interaction Task Evaluation

The Interaction Gap in Text Embedding Benchmarks

MTEB (Massive Text Embedding Benchmark) is the de facto standard for evaluating text embeddings, but all 8 of its task types measure similarity — none measure interaction (complementarity, asymmetric matching, conditional compatibility).

MITE reframes existing MTEB datasets to evaluate interaction tasks: given two texts, does their combination produce a meaningful outcome beyond topical similarity?

The Core Finding

MTEB rankings fail to predict interaction task performance. We take the same datasets used by MTEB, evaluate the same models, but ask an interaction question instead of a similarity question. The rankings diverge.

Dataset MTEB Question (Similarity) MITE Question (Interaction)
SICK-R How similar are these sentences? (1-5) Does sentence A entail sentence B? (asymmetric)
FEVER Retrieve relevant docs for this claim Does evidence support or refute the claim?
FiQA Find relevant financial answers How well does this answer resolve the question?
SummEval How similar is summary to source? How well does this summary capture key information?

Installation

pip install -e .

# For API model support (OpenAI, Voyage, Cohere):
pip install -e ".[api]"

Quick Start

from mite.tasks import SICKREntailmentTask
from mite.models import SentenceTransformerModel

# Load a task
task = SICKREntailmentTask()
task.load_data()

# Evaluate a model
model = SentenceTransformerModel("all-MiniLM-L6-v2")
results = task.evaluate(model)
print(results)

Run Full Benchmark

# Run MITE evaluation on all tasks
python scripts/run_mite.py --models all-MiniLM-L6-v2 BAAI/bge-small-en-v1.5

# Compare with MTEB rankings
python scripts/compare_rankings.py --results-dir results/

Tasks

Task Dataset Interaction Type Metric
SICKREntailmentTask SICK-R Textual entailment (asymmetric) Macro F1, Directional accuracy
FEVERInteractionTask FEVER Claim verification (support vs refute) Separation score, AUROC
ClimateFEVERInteractionTask ClimateFEVER Claim verification Separation score, AUROC
SciFActInteractionTask SciFact Scientific claim verification Separation score, AUROC
FiQAInteractionTask FiQA-2018 Answer quality ranking Spearman correlation
CQADupstackInteractionTask CQADupstack Answer acceptance prediction AUROC, Accuracy
SummEvalInteractionTask SummEval Summary quality assessment Spearman correlation

Paper

See paper/mite_paper.tex for the full paper: "The Interaction Gap: Why Text Embedding Benchmarks Miss What Matters".

License

Apache 2.0

About

MITE: Massive Interaction Task Evaluation — benchmarking embeddings on interaction tasks that MTEB misses

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors