[P1][Benchmark] BrowseComp agentic search benchmark integration tracking

## Problem Statement

Agentic search benchmark coverage is incomplete for current frontier comparisons. `BrowseComp` is frequently reported by model providers but not yet integrated as an lmms-eval task.

## Proposed Solution

Add BrowseComp benchmark integration under `lmms_eval/tasks/browsecomp/`, including task config(s), tool-use or constrained interaction adapter as required by benchmark format, and aggregation metrics.

## Acceptance Criteria

* `lmms_eval/tasks/browsecomp/` exists with runnable YAML + `utils.py`.
* BrowseComp task names appear in `python -m lmms_eval --tasks list`.
* At least one smoke run (`--limit 5`) works with configured backend.
* `docs/current_tasks.md` documents BrowseComp support.

## Implementation Plan

1. Map BrowseComp task format to lmms-eval output type.
2. Implement evaluation and scoring utilities.
3. Add task YAML and defaults for generation args.
4. Add docs entry and example invocation.

## Technical Notes

* Align with existing `generate_until_agentic` and agentic trace pathways if benchmark requires tool steps.
* Keep deterministic mode available for CI-friendly smoke tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P1][Benchmark] BrowseComp agentic search benchmark integration tracking #1113

Problem Statement

Proposed Solution

Acceptance Criteria

Implementation Plan

Technical Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[P1][Benchmark] BrowseComp agentic search benchmark integration tracking #1113

Description

Problem Statement

Proposed Solution

Acceptance Criteria

Implementation Plan

Technical Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions