-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Feature Request: Search Model Human Alignment Evaluation
Background
Jia Qi brought up an amazing paper today: [MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures](https://proceedings.neurips.cc/paper_files/paper/2024/file/b1f34d7b4a03a3d80be8e72eb430dd81-Paper-Conference.pdf)
Gist of the paper: Compares a set of human Google searches with the distribution of different datasets.
Proposed Feature
Spin-off idea: Measure the closeness of search models' distribution with these human Google searches instead, so as to model how well our model aligns with human search preferences.
We can then also rank other models on this same dataset and make a qualitative evaluation on which model is the most "human-aligned" to search.
Implementation
- Adapt MixEval's methodology from LLM evaluation to search model alignment
- Use distributional similarity between human search patterns and search model outputs
- Create leaderboard ranking different search models on human alignment
- Build evaluation framework that's efficient and reproducible
Some open questions
- What dataset/where to collect the human google search data (Yuuki has been driving some collation of internal search data by sending his own search queries in
#jan-model-internal/Random Question for Model from User
on Discord (search forRandom question from user for
on discord to find the thread) - Which dataset should be used to curate the distributions of different search models (At a high level this is open to discussion with the team on how to do this)
Expected Outcome
Standardized benchmark for measuring how well search models align with human search behavior and preferences.
A picture of how the evaluation could look like can be seen in this excalidraw.
