add deep research agent benchmark script. #2135

lkk12014402 · 2025-07-11T09:05:46Z

Description

add benchmark script for evaluating the accuracy of deep research agent.

github-actions · 2025-07-11T09:06:00Z

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

DeepResearchAgent/requirements.txt

Copilot

Pull Request Overview

Adds a benchmarking harness for measuring the accuracy of the Deep Research Agent by integrating dataset loaders, an LLM-based scoring function, and CLI support.

Wraps the agent’s raw output in a JSON field ("answer") in research_agent.py.
Introduces eval.py to load questions, invoke the agent, score answers via an LLM judge, and save results.
Updates documentation under benchmark/accuracy and the main README to cover setup and usage.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
DeepResearchAgent/research_agent.py	Return payload now wrapped as `{"answer": ...}`
DeepResearchAgent/benchmark/accuracy/eval.py	New benchmark script for accuracy evaluation
DeepResearchAgent/benchmark/accuracy/README.md	Instructions for running the accuracy benchmark
DeepResearchAgent/README.md	Added note on configuring `deep_researcher.yaml` in setup

Comments suppressed due to low confidence (2)

DeepResearchAgent/benchmark/accuracy/eval.py:93

[nitpick] There are no unit tests covering key functions like load_questions, process_single_question, or run_benchmark. Adding tests will help ensure correctness when loading datasets and computing accuracy.

def load_questions(dataset_names: list[str] | None = None) -> list[dict[str, str]]:

DeepResearchAgent/benchmark/accuracy/eval.py:334

The args.agent_config field is referenced in metadata but no --agent-config argument is defined in the parser. Consider adding a corresponding parser argument or renaming this field to match --service-url or another existing flag.

            "agent_config": args.agent_config,

DeepResearchAgent/benchmark/accuracy/README.md

for more information, see https://pre-commit.ci

minmin-intel

Can you add some benchmark results in README? And add some recommendations about what LLMs to use with vllm-gaudi?

lkk12014402 · 2025-07-16T02:37:10Z

Can you add some benchmark results in README? And add some recommendations about what LLMs to use with vllm-gaudi?

no problem. I will

for more information, see https://pre-commit.ci

lkk12014402 · 2025-08-13T03:43:29Z

Can you add some benchmark results in README? And add some recommendations about what LLMs to use with vllm-gaudi?

add togethercomputer/together-search-bench benchmark accuracy data

for more information, see https://pre-commit.ci

joshuayao · 2025-08-14T03:18:24Z

Hi @lkk12014402 could you please help check the CI failures? Thanks.

Copilot AI review requested due to automatic review settings July 11, 2025 09:05

lkk12014402 requested review from chensuyue, ftian1, lvliang-intel, minmin-intel and rbrugaro as code owners July 11, 2025 09:05

Copilot AI reviewed Jul 11, 2025

View reviewed changes

DeepResearchAgent/benchmark/accuracy/README.md Outdated Show resolved Hide resolved

lkk12014402 and others added 2 commits July 11, 2025 09:10

add benchmark script.

1e9aed5

[pre-commit.ci] auto fixes from pre-commit.com hooks

75f9c26

for more information, see https://pre-commit.ci

joshuayao added this to the v1.4 milestone Jul 14, 2025

Merge branch 'main' into deep_research_agent_benchmark

594ffec

minmin-intel reviewed Jul 14, 2025

View reviewed changes

lkk12014402 and others added 4 commits August 6, 2025 12:43

Merge branch 'main' into deep_research_agent_benchmark

4d0699f

Merge branch 'main' into deep_research_agent_benchmark

eda2714

[pre-commit.ci] auto fixes from pre-commit.com hooks

63c962f

for more information, see https://pre-commit.ci

Update README.md

e8395e9

pre-commit-ci bot and others added 3 commits August 13, 2025 03:43

[pre-commit.ci] auto fixes from pre-commit.com hooks

c696784

for more information, see https://pre-commit.ci

update dependencies and readme.

fa9ead3

Merge branch 'main' into deep_research_agent_benchmark

4264de3

joshuayao added this to OPEA Aug 13, 2025

joshuayao moved this to In review in OPEA Aug 13, 2025

joshuayao requested review from XinyuYe-Intel and letonghan August 14, 2025 03:07

letonghan approved these changes Aug 14, 2025

View reviewed changes

joshuayao moved this from In review to In progress in OPEA Aug 14, 2025

Update README.md

2458327

Merge branch 'main' into deep_research_agent_benchmark

2b63b90

lvliang-intel approved these changes Aug 14, 2025

View reviewed changes

ZePan110 approved these changes Aug 14, 2025

View reviewed changes

chensuyue merged commit fe1527f into main Aug 20, 2025
19 checks passed

chensuyue deleted the deep_research_agent_benchmark branch August 20, 2025 03:37

github-project-automation bot moved this from In progress to Done in OPEA Aug 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add deep research agent benchmark script. #2135

add deep research agent benchmark script. #2135

lkk12014402 commented Jul 11, 2025

Uh oh!

github-actions bot commented Jul 11, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

minmin-intel left a comment

Uh oh!

lkk12014402 commented Jul 16, 2025

Uh oh!

lkk12014402 commented Aug 13, 2025

Uh oh!

joshuayao commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

add deep research agent benchmark script. #2135

add deep research agent benchmark script. #2135

Conversation

lkk12014402 commented Jul 11, 2025

Description

Uh oh!

github-actions bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

minmin-intel left a comment

Choose a reason for hiding this comment

Uh oh!

lkk12014402 commented Jul 16, 2025

Uh oh!

lkk12014402 commented Aug 13, 2025

Uh oh!

joshuayao commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

github-actions bot commented Jul 11, 2025 •

edited

Loading