epic: Evaluating Search Models

# Project
1. Search Eval without tool calling (@new5558)
2. Visualising Tool Calling Traces (@DESU-CLUB)  
3. MixEval for Search (KIV) 


# MixEval for Search
The current MixEval is based on the LMArena human preference leaderboard
Its main selling point is that it aligns well to the human preference on LMArena
However, a limitation of MixEval is that it does not account for new tool-calling web-search models such as Jan-v1
LMArena has another benchmark called SearchArena
This could be good source material for a web-search based MixEval, which spans across all the major search-based evals
Jan Exam v1, SimpleQA, FreshQA, SealQA

# Approach
- [ ] Replicate the embedding strategy of MixEval
- [ ] Mine a dataset of human search queries from SearchArena
- [ ] Prepare a plot across all the datasets just like in MixEval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

epic: Evaluating Search Models #3

Project

MixEval for Search

Approach

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

epic: Evaluating Search Models #3

Description

Project

MixEval for Search

Approach

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions