Skip to content

feat: visualising search traces #2

@DESU-CLUB

Description

@DESU-CLUB

Feature Request: Search Model Human Alignment Evaluation

Background

Jia Qi brought up an amazing paper today: [MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures](https://proceedings.neurips.cc/paper_files/paper/2024/file/b1f34d7b4a03a3d80be8e72eb430dd81-Paper-Conference.pdf)

Gist of the paper: Compares a set of human Google searches with the distribution of different datasets.

Proposed Feature

Spin-off idea: Measure the closeness of search models' distribution with these human Google searches instead, so as to model how well our model aligns with human search preferences.

We can then also rank other models on this same dataset and make a qualitative evaluation on which model is the most "human-aligned" to search.

Implementation

  • Adapt MixEval's methodology from LLM evaluation to search model alignment
  • Use distributional similarity between human search patterns and search model outputs
  • Create leaderboard ranking different search models on human alignment
  • Build evaluation framework that's efficient and reproducible

Some open questions

  • What dataset/where to collect the human google search data (Yuuki has been driving some collation of internal search data by sending his own search queries in #jan-model-internal/Random Question for Model from User on Discord (search for Random question from user for on discord to find the thread)
  • Which dataset should be used to curate the distributions of different search models (At a high level this is open to discussion with the team on how to do this)

Expected Outcome

Standardized benchmark for measuring how well search models align with human search behavior and preferences.

A picture of how the evaluation could look like can be seen in this excalidraw.

Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status

Eng Planning

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions