Problem Statement
Agentic search benchmark coverage is incomplete for current frontier comparisons. BrowseComp is frequently reported by model providers but not yet integrated as an lmms-eval task.
Proposed Solution
Add BrowseComp benchmark integration under lmms_eval/tasks/browsecomp/, including task config(s), tool-use or constrained interaction adapter as required by benchmark format, and aggregation metrics.
Acceptance Criteria
lmms_eval/tasks/browsecomp/ exists with runnable YAML + utils.py.
- BrowseComp task names appear in
python -m lmms_eval --tasks list.
- At least one smoke run (
--limit 5) works with configured backend.
docs/current_tasks.md documents BrowseComp support.
Implementation Plan
- Map BrowseComp task format to lmms-eval output type.
- Implement evaluation and scoring utilities.
- Add task YAML and defaults for generation args.
- Add docs entry and example invocation.
Technical Notes
- Align with existing
generate_until_agentic and agentic trace pathways if benchmark requires tool steps.
- Keep deterministic mode available for CI-friendly smoke tests.
Problem Statement
Agentic search benchmark coverage is incomplete for current frontier comparisons.
BrowseCompis frequently reported by model providers but not yet integrated as an lmms-eval task.Proposed Solution
Add BrowseComp benchmark integration under
lmms_eval/tasks/browsecomp/, including task config(s), tool-use or constrained interaction adapter as required by benchmark format, and aggregation metrics.Acceptance Criteria
lmms_eval/tasks/browsecomp/exists with runnable YAML +utils.py.python -m lmms_eval --tasks list.--limit 5) works with configured backend.docs/current_tasks.mddocuments BrowseComp support.Implementation Plan
Technical Notes
generate_until_agenticand agentic trace pathways if benchmark requires tool steps.