[BENCHMARK BUG] Gaia2: Gemini‑2.5‑Pro scores far below paper

**📊 Benchmark Bug Description**:
 - Benchmark: Gaia2
 - Issue: The run completes and produces [benchmark_stats.json](https://github.com/user-attachments/files/23447723/benchmark_stats.json), but all category scores are far below the Gemini‑2.5‑Pro results reported in the paper.
 - Request: Please provide authoritative evaluation commands, parameters (including all defaults), judge configuration, seeds/repeats, and a small, reproducible reference run (**traces**) to align evaluation protocol and identify the cause of the discrepancy.


**🎯 Benchmark Bug Category**:

- [ ] 📈 Performance Issue (slow execution, memory usage, timeouts)
- [ ] 📊 Metrics Issue (incorrect calculations, missing metrics)
- [x] 🔄 Reproducibility Issue (inconsistent results across runs)
- [ ] 💾 Data Issue (incorrect datasets, missing data, data corruption)
- [ ] ⚖️ Evaluation Issue (scoring problems, comparison errors)
- [ ] 🏃 Execution Issue (benchmark fails to run, crashes during evaluation)

## 📋 Benchmark Details
**Benchmark Information**:
- Benchmark Name: `Gaia2`
- Benchmark Version/Commit: `78ea3bdbdeec2bdcd6afa5420915d8a22f23ed99`
- Dataset Used: `meta-agents-research-environments/gaia2`
- Dataset Split: `validation`
- Dataset Config: `all`

**Model Configuration**:
- Model: `Gemini‑2.5‑Pro`
- Provider: `local`
- Agent: `default`

**Evaluation Parameters**:
- Number of runs: `3`
- Scenario timeout: `900 seconds`
- Max concurrent scenarios: `auto-detected`
- Limit: `no limit`
- Oracle mode: `disabled`
- Noise augmentation: `disabled`
- Agent2Agent proportion: `0.0`

**Judge Configuration** (if applicable):
- Judge model: `meta-llama/Meta-Llama-3.3-70B-Instruct (default)`
- Judge provider: `local`

```bash
are-benchmark run \
  --hf ./data/gaia2 \
  --hf-config search \
  --provider local \
  -a default \
  --model gemini-2.5-pro \
  --judge_provider local \
  --judge_model gemini-2.5-pro
```

## 🔄 Steps to Reproduce
```bash
mkdir data
hf download meta-agents-research-environments/gaia2 --repo-type=dataset --local-dir .data/gaia2
hf download meta-agents-research-environments/gaia2_filesystem --repo-type=dataset --local-dir ./datagaia2_filesystem

export OPENAI_API_KEY="sk-XXXX" OPENAI_BASE_URL="https://xxx.xxx.xxx/v1"
export USE_LITELLM_PROXY=True
export DEMO_FS_PATH=./data/gaia2_filesystem/demo_filesystem

are-benchmark run \
  --hf ./data/gaia2 \
  --hf-config search \
  --provider local \
  -a default \
  --model gemini-2.5-pro \
  --judge_provider local \
  --judge_model gemini-2.5-pro
```

**Frequency**:
- [x] Always happens
- [ ] Happens most of the time (>75%)
- [ ] Happens sometimes (25-75%)
- [ ] Rarely happens (<25%)
- [ ] Only under specific conditions (describe below)

## 📈 Metrics & Results
**Expected Metrics**:
```
  Execution:     39.2 ± 2.2
  Search:        57.7 ± 2.3
  Ambiguity:     18.1 ± 1.8
  Adaptability:  17.5 ± 1.7
  Time:           7.3 ± 1.2
  Noise:         20.4 ± 1.8
  Agent2Agent:   20.4 ± 1.8
```

**Actual Metrics**:
```
  Execution:      9.8 ± 0.2
  Search:        28.3 ± 1.3
  Ambiguity:      0.0 ± 0.0
  Adaptability:   0.0 ± 0.0
  Time:           2.3 ± 0.4
  Noise:          5.0 ± 0.6
  Agent2Agent:    6.0 ± 0.6
```

## 🔧 System & Environment Information
**Software Environment**:
- OS: Ubuntu 22.04
- Architecture: x64

## 🐍 Python Environment Details
**Python Information**:
- Python version: 3.13.9
- Python implementation: CPython
- Virtual environment: venv

**Meta Agents Research Environments Installation**:
- Package version: 1.1.0
- Installation method: pip install meta-agents-research-environments
- Installation location: site-packages

**Key Dependencies**:
```
huggingface-hub                   0.33.4
litellm                           1.71.1
meta-agents-research-environments 1.1.0
numpy                             2.2.6
pandas                            2.2.3
```

**Full Environment** (if dependency-related):
```
aiobotocore==2.25.1
aiohappyeyeballs==2.6.1
aiohttp==3.13.2
aioitertools==0.12.0
aiosignal==1.4.0
annotated-types==0.7.0
anyio==4.11.0
asttokens==3.0.0
attrs==25.4.0
beautifulsoup4==4.14.2
botocore==1.40.61
certifi==2025.10.5
cffi==2.0.0
charset-normalizer==3.4.4
click==8.1.8
cobble==0.1.4
cryptography==46.0.3
datasets==4.0.0
decorator==5.2.1
deepdiff==8.6.1
dill==0.3.8
distro==1.9.0
docstring_parser==0.16
executing==2.2.1
fastapi==0.116.1
filelock==3.20.0
frozenlist==1.8.0
fsspec==2024.12.0
geographiclib==2.0
graphql-core==3.2.7
h11==0.16.0
hf-xet==1.2.0
httpcore==1.0.9
httpx==0.28.1
httpx-aiohttp==0.1.9
httpx-sse==0.4.3
huggingface-hub==0.33.4
idna==3.11
importlib_metadata==8.7.0
inputimeout==1.0.4
ipython==9.7.0
ipython_pygments_lexers==1.1.1
jedi==0.19.2
Jinja2==3.1.6
jiter==0.11.1
jmespath==1.0.1
joblib==1.4.2
jsonschema==4.25.1
jsonschema-specifications==2025.9.1
litellm==1.71.1
lxml==6.0.2
mammoth==1.8.0
markdown-it-py==4.0.0
markdownify==0.14.1
MarkupSafe==3.0.3
matplotlib-inline==0.2.1
mcp==1.11.0
mdurl==0.1.2
meta-agents-research-environments==1.1.0
multidict==6.7.0
multiprocess==0.70.16
mysql-connector-python==9.2.0
numpy==2.2.6
openai==2.7.1
orderly-set==5.5.0
packaging==25.0
pandas==2.2.3
parso==0.8.5
pathvalidate==3.2.1
pdfminer.six==20231228
pexpect==4.9.0
phonenumbers==8.13.53
pillow==11.1.0
polars-lts-cpu==1.33.1
prompt_toolkit==3.0.52
propcache==0.4.1
ptyprocess==0.7.0
pure_eval==0.2.3
puremagic==1.27
py==1.11.0
pyarrow==22.0.0
pycparser==2.23
pydantic==2.10.6
pydantic-settings==2.11.0
pydantic_core==2.27.2
Pygments==2.19.2
pypdf==6.0.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
python-pptx==1.0.2
pytz==2025.2
PyYAML==6.0.3
RapidFuzz==3.12.1
referencing==0.37.0
regex==2025.11.3
requests==2.32.5
retry==0.9.2
rich==14.2.0
rpds-py==0.28.0
s3fs==2024.12.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
soupsieve==2.8
sse-starlette==3.0.3
stack-data==0.6.3
starlette==0.47.2
strawberry-graphql==0.275.5
termcolor==2.5.0
tiktoken==0.12.0
tokenizers==0.22.1
tqdm==4.67.1
tqdm_joblib==0.0.4
traitlets==5.14.3
typer==0.20.0
typing-inspection==0.4.2
typing_extensions==4.15.0
tzdata==2025.2
urllib3==2.5.0
uvicorn==0.35.0
wcwidth==0.2.14
wrapt==1.17.3
wsproto==1.2.0
xlsxwriter==3.2.9
xxhash==3.6.0
yarl==1.22.0
zipp==3.23.0
```

## 📊 Impact Assessment
**Severity**:
- [ ] 🔴 Critical (benchmark completely unusable)
- [x] 🟠 High (significantly impacts evaluation reliability)
- [ ] 🟡 Medium (minor impact on benchmark results)
- [ ] 🟢 Low (cosmetic or edge case)

**Research Impact**:
- [x] Blocks research progress
- [x] Affects result reliability
- [x] Impacts performance comparisons
- [ ] Minor inconvenience

**Reproducibility Impact**:
- [x] Makes results non-reproducible
- [x] Causes inconsistent results
- [ ] Minor variation in results
- [ ] No impact on reproducibility

## 🔧 Workaround
**Available Workaround**:
- [ ] Yes (describe below)
- [x] No

**Workaround Description**:
None. **Need official commands/parameters/judge/traces to align evaluation protocol.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BENCHMARK BUG] Gaia2: Gemini‑2.5‑Pro scores far below paper #25

📋 Benchmark Details

🔄 Steps to Reproduce

📈 Metrics & Results

🔧 System & Environment Information

🐍 Python Environment Details

📊 Impact Assessment

🔧 Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BENCHMARK BUG] Gaia2: Gemini‑2.5‑Pro scores far below paper #25

Description

📋 Benchmark Details

🔄 Steps to Reproduce

📈 Metrics & Results

🔧 System & Environment Information

🐍 Python Environment Details

📊 Impact Assessment

🔧 Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions