-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Project
- Search Eval without tool calling (@new5558)
- Visualising Tool Calling Traces (@DESU-CLUB)
- MixEval for Search (KIV)
MixEval for Search
The current MixEval is based on the LMArena human preference leaderboard
Its main selling point is that it aligns well to the human preference on LMArena
However, a limitation of MixEval is that it does not account for new tool-calling web-search models such as Jan-v1
LMArena has another benchmark called SearchArena
This could be good source material for a web-search based MixEval, which spans across all the major search-based evals
Jan Exam v1, SimpleQA, FreshQA, SealQA
Approach
- Replicate the embedding strategy of MixEval
- Mine a dataset of human search queries from SearchArena
- Prepare a plot across all the datasets just like in MixEval