Skip to content

epic: Evaluating Search Models #3

@Yip-Jia-Qi

Description

@Yip-Jia-Qi

Project

  1. Search Eval without tool calling (@new5558)
  2. Visualising Tool Calling Traces (@DESU-CLUB)
  3. MixEval for Search (KIV)

MixEval for Search

The current MixEval is based on the LMArena human preference leaderboard
Its main selling point is that it aligns well to the human preference on LMArena
However, a limitation of MixEval is that it does not account for new tool-calling web-search models such as Jan-v1
LMArena has another benchmark called SearchArena
This could be good source material for a web-search based MixEval, which spans across all the major search-based evals
Jan Exam v1, SimpleQA, FreshQA, SealQA

Approach

  • Replicate the embedding strategy of MixEval
  • Mine a dataset of human search queries from SearchArena
  • Prepare a plot across all the datasets just like in MixEval

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Eng Planning

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions