Skip to content

[P1][Benchmark] BrowseComp agentic search benchmark integration tracking #1113

@Luodian

Description

@Luodian

Problem Statement

Agentic search benchmark coverage is incomplete for current frontier comparisons. BrowseComp is frequently reported by model providers but not yet integrated as an lmms-eval task.

Proposed Solution

Add BrowseComp benchmark integration under lmms_eval/tasks/browsecomp/, including task config(s), tool-use or constrained interaction adapter as required by benchmark format, and aggregation metrics.

Acceptance Criteria

  • lmms_eval/tasks/browsecomp/ exists with runnable YAML + utils.py.
  • BrowseComp task names appear in python -m lmms_eval --tasks list.
  • At least one smoke run (--limit 5) works with configured backend.
  • docs/current_tasks.md documents BrowseComp support.

Implementation Plan

  1. Map BrowseComp task format to lmms-eval output type.
  2. Implement evaluation and scoring utilities.
  3. Add task YAML and defaults for generation args.
  4. Add docs entry and example invocation.

Technical Notes

  • Align with existing generate_until_agentic and agentic trace pathways if benchmark requires tool steps.
  • Keep deterministic mode available for CI-friendly smoke tests.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions