Skip to content

[BENCHMARK BUG] Gaia2: Gemini‑2.5‑Pro scores far below paper #25

@tshu-w

Description

@tshu-w

📊 Benchmark Bug Description:

  • Benchmark: Gaia2
  • Issue: The run completes and produces benchmark_stats.json, but all category scores are far below the Gemini‑2.5‑Pro results reported in the paper.
  • Request: Please provide authoritative evaluation commands, parameters (including all defaults), judge configuration, seeds/repeats, and a small, reproducible reference run (traces) to align evaluation protocol and identify the cause of the discrepancy.

🎯 Benchmark Bug Category:

  • 📈 Performance Issue (slow execution, memory usage, timeouts)
  • 📊 Metrics Issue (incorrect calculations, missing metrics)
  • 🔄 Reproducibility Issue (inconsistent results across runs)
  • 💾 Data Issue (incorrect datasets, missing data, data corruption)
  • ⚖️ Evaluation Issue (scoring problems, comparison errors)
  • 🏃 Execution Issue (benchmark fails to run, crashes during evaluation)

📋 Benchmark Details

Benchmark Information:

  • Benchmark Name: Gaia2
  • Benchmark Version/Commit: 78ea3bdbdeec2bdcd6afa5420915d8a22f23ed99
  • Dataset Used: meta-agents-research-environments/gaia2
  • Dataset Split: validation
  • Dataset Config: all

Model Configuration:

  • Model: Gemini‑2.5‑Pro
  • Provider: local
  • Agent: default

Evaluation Parameters:

  • Number of runs: 3
  • Scenario timeout: 900 seconds
  • Max concurrent scenarios: auto-detected
  • Limit: no limit
  • Oracle mode: disabled
  • Noise augmentation: disabled
  • Agent2Agent proportion: 0.0

Judge Configuration (if applicable):

  • Judge model: meta-llama/Meta-Llama-3.3-70B-Instruct (default)
  • Judge provider: local
are-benchmark run \
  --hf ./data/gaia2 \
  --hf-config search \
  --provider local \
  -a default \
  --model gemini-2.5-pro \
  --judge_provider local \
  --judge_model gemini-2.5-pro

🔄 Steps to Reproduce

mkdir data
hf download meta-agents-research-environments/gaia2 --repo-type=dataset --local-dir .data/gaia2
hf download meta-agents-research-environments/gaia2_filesystem --repo-type=dataset --local-dir ./datagaia2_filesystem

export OPENAI_API_KEY="sk-XXXX" OPENAI_BASE_URL="https://xxx.xxx.xxx/v1"
export USE_LITELLM_PROXY=True
export DEMO_FS_PATH=./data/gaia2_filesystem/demo_filesystem

are-benchmark run \
  --hf ./data/gaia2 \
  --hf-config search \
  --provider local \
  -a default \
  --model gemini-2.5-pro \
  --judge_provider local \
  --judge_model gemini-2.5-pro

Frequency:

  • Always happens
  • Happens most of the time (>75%)
  • Happens sometimes (25-75%)
  • Rarely happens (<25%)
  • Only under specific conditions (describe below)

📈 Metrics & Results

Expected Metrics:

  Execution:     39.2 ± 2.2
  Search:        57.7 ± 2.3
  Ambiguity:     18.1 ± 1.8
  Adaptability:  17.5 ± 1.7
  Time:           7.3 ± 1.2
  Noise:         20.4 ± 1.8
  Agent2Agent:   20.4 ± 1.8

Actual Metrics:

  Execution:      9.8 ± 0.2
  Search:        28.3 ± 1.3
  Ambiguity:      0.0 ± 0.0
  Adaptability:   0.0 ± 0.0
  Time:           2.3 ± 0.4
  Noise:          5.0 ± 0.6
  Agent2Agent:    6.0 ± 0.6

🔧 System & Environment Information

Software Environment:

  • OS: Ubuntu 22.04
  • Architecture: x64

🐍 Python Environment Details

Python Information:

  • Python version: 3.13.9
  • Python implementation: CPython
  • Virtual environment: venv

Meta Agents Research Environments Installation:

  • Package version: 1.1.0
  • Installation method: pip install meta-agents-research-environments
  • Installation location: site-packages

Key Dependencies:

huggingface-hub                   0.33.4
litellm                           1.71.1
meta-agents-research-environments 1.1.0
numpy                             2.2.6
pandas                            2.2.3

Full Environment (if dependency-related):

aiobotocore==2.25.1
aiohappyeyeballs==2.6.1
aiohttp==3.13.2
aioitertools==0.12.0
aiosignal==1.4.0
annotated-types==0.7.0
anyio==4.11.0
asttokens==3.0.0
attrs==25.4.0
beautifulsoup4==4.14.2
botocore==1.40.61
certifi==2025.10.5
cffi==2.0.0
charset-normalizer==3.4.4
click==8.1.8
cobble==0.1.4
cryptography==46.0.3
datasets==4.0.0
decorator==5.2.1
deepdiff==8.6.1
dill==0.3.8
distro==1.9.0
docstring_parser==0.16
executing==2.2.1
fastapi==0.116.1
filelock==3.20.0
frozenlist==1.8.0
fsspec==2024.12.0
geographiclib==2.0
graphql-core==3.2.7
h11==0.16.0
hf-xet==1.2.0
httpcore==1.0.9
httpx==0.28.1
httpx-aiohttp==0.1.9
httpx-sse==0.4.3
huggingface-hub==0.33.4
idna==3.11
importlib_metadata==8.7.0
inputimeout==1.0.4
ipython==9.7.0
ipython_pygments_lexers==1.1.1
jedi==0.19.2
Jinja2==3.1.6
jiter==0.11.1
jmespath==1.0.1
joblib==1.4.2
jsonschema==4.25.1
jsonschema-specifications==2025.9.1
litellm==1.71.1
lxml==6.0.2
mammoth==1.8.0
markdown-it-py==4.0.0
markdownify==0.14.1
MarkupSafe==3.0.3
matplotlib-inline==0.2.1
mcp==1.11.0
mdurl==0.1.2
meta-agents-research-environments==1.1.0
multidict==6.7.0
multiprocess==0.70.16
mysql-connector-python==9.2.0
numpy==2.2.6
openai==2.7.1
orderly-set==5.5.0
packaging==25.0
pandas==2.2.3
parso==0.8.5
pathvalidate==3.2.1
pdfminer.six==20231228
pexpect==4.9.0
phonenumbers==8.13.53
pillow==11.1.0
polars-lts-cpu==1.33.1
prompt_toolkit==3.0.52
propcache==0.4.1
ptyprocess==0.7.0
pure_eval==0.2.3
puremagic==1.27
py==1.11.0
pyarrow==22.0.0
pycparser==2.23
pydantic==2.10.6
pydantic-settings==2.11.0
pydantic_core==2.27.2
Pygments==2.19.2
pypdf==6.0.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
python-pptx==1.0.2
pytz==2025.2
PyYAML==6.0.3
RapidFuzz==3.12.1
referencing==0.37.0
regex==2025.11.3
requests==2.32.5
retry==0.9.2
rich==14.2.0
rpds-py==0.28.0
s3fs==2024.12.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
soupsieve==2.8
sse-starlette==3.0.3
stack-data==0.6.3
starlette==0.47.2
strawberry-graphql==0.275.5
termcolor==2.5.0
tiktoken==0.12.0
tokenizers==0.22.1
tqdm==4.67.1
tqdm_joblib==0.0.4
traitlets==5.14.3
typer==0.20.0
typing-inspection==0.4.2
typing_extensions==4.15.0
tzdata==2025.2
urllib3==2.5.0
uvicorn==0.35.0
wcwidth==0.2.14
wrapt==1.17.3
wsproto==1.2.0
xlsxwriter==3.2.9
xxhash==3.6.0
yarl==1.22.0
zipp==3.23.0

📊 Impact Assessment

Severity:

  • 🔴 Critical (benchmark completely unusable)
  • 🟠 High (significantly impacts evaluation reliability)
  • 🟡 Medium (minor impact on benchmark results)
  • 🟢 Low (cosmetic or edge case)

Research Impact:

  • Blocks research progress
  • Affects result reliability
  • Impacts performance comparisons
  • Minor inconvenience

Reproducibility Impact:

  • Makes results non-reproducible
  • Causes inconsistent results
  • Minor variation in results
  • No impact on reproducibility

🔧 Workaround

Available Workaround:

  • Yes (describe below)
  • No

Workaround Description:
None. Need official commands/parameters/judge/traces to align evaluation protocol.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions