-
Notifications
You must be signed in to change notification settings - Fork 60
Open
Labels
bugSomething isn't workingSomething isn't working
Description
📊 Benchmark Bug Description:
- Benchmark: Gaia2
- Issue: The run completes and produces benchmark_stats.json, but all category scores are far below the Gemini‑2.5‑Pro results reported in the paper.
- Request: Please provide authoritative evaluation commands, parameters (including all defaults), judge configuration, seeds/repeats, and a small, reproducible reference run (traces) to align evaluation protocol and identify the cause of the discrepancy.
🎯 Benchmark Bug Category:
- 📈 Performance Issue (slow execution, memory usage, timeouts)
- 📊 Metrics Issue (incorrect calculations, missing metrics)
- 🔄 Reproducibility Issue (inconsistent results across runs)
- 💾 Data Issue (incorrect datasets, missing data, data corruption)
- ⚖️ Evaluation Issue (scoring problems, comparison errors)
- 🏃 Execution Issue (benchmark fails to run, crashes during evaluation)
📋 Benchmark Details
Benchmark Information:
- Benchmark Name:
Gaia2 - Benchmark Version/Commit:
78ea3bdbdeec2bdcd6afa5420915d8a22f23ed99 - Dataset Used:
meta-agents-research-environments/gaia2 - Dataset Split:
validation - Dataset Config:
all
Model Configuration:
- Model:
Gemini‑2.5‑Pro - Provider:
local - Agent:
default
Evaluation Parameters:
- Number of runs:
3 - Scenario timeout:
900 seconds - Max concurrent scenarios:
auto-detected - Limit:
no limit - Oracle mode:
disabled - Noise augmentation:
disabled - Agent2Agent proportion:
0.0
Judge Configuration (if applicable):
- Judge model:
meta-llama/Meta-Llama-3.3-70B-Instruct (default) - Judge provider:
local
are-benchmark run \
--hf ./data/gaia2 \
--hf-config search \
--provider local \
-a default \
--model gemini-2.5-pro \
--judge_provider local \
--judge_model gemini-2.5-pro🔄 Steps to Reproduce
mkdir data
hf download meta-agents-research-environments/gaia2 --repo-type=dataset --local-dir .data/gaia2
hf download meta-agents-research-environments/gaia2_filesystem --repo-type=dataset --local-dir ./datagaia2_filesystem
export OPENAI_API_KEY="sk-XXXX" OPENAI_BASE_URL="https://xxx.xxx.xxx/v1"
export USE_LITELLM_PROXY=True
export DEMO_FS_PATH=./data/gaia2_filesystem/demo_filesystem
are-benchmark run \
--hf ./data/gaia2 \
--hf-config search \
--provider local \
-a default \
--model gemini-2.5-pro \
--judge_provider local \
--judge_model gemini-2.5-proFrequency:
- Always happens
- Happens most of the time (>75%)
- Happens sometimes (25-75%)
- Rarely happens (<25%)
- Only under specific conditions (describe below)
📈 Metrics & Results
Expected Metrics:
Execution: 39.2 ± 2.2
Search: 57.7 ± 2.3
Ambiguity: 18.1 ± 1.8
Adaptability: 17.5 ± 1.7
Time: 7.3 ± 1.2
Noise: 20.4 ± 1.8
Agent2Agent: 20.4 ± 1.8
Actual Metrics:
Execution: 9.8 ± 0.2
Search: 28.3 ± 1.3
Ambiguity: 0.0 ± 0.0
Adaptability: 0.0 ± 0.0
Time: 2.3 ± 0.4
Noise: 5.0 ± 0.6
Agent2Agent: 6.0 ± 0.6
🔧 System & Environment Information
Software Environment:
- OS: Ubuntu 22.04
- Architecture: x64
🐍 Python Environment Details
Python Information:
- Python version: 3.13.9
- Python implementation: CPython
- Virtual environment: venv
Meta Agents Research Environments Installation:
- Package version: 1.1.0
- Installation method: pip install meta-agents-research-environments
- Installation location: site-packages
Key Dependencies:
huggingface-hub 0.33.4
litellm 1.71.1
meta-agents-research-environments 1.1.0
numpy 2.2.6
pandas 2.2.3
Full Environment (if dependency-related):
aiobotocore==2.25.1
aiohappyeyeballs==2.6.1
aiohttp==3.13.2
aioitertools==0.12.0
aiosignal==1.4.0
annotated-types==0.7.0
anyio==4.11.0
asttokens==3.0.0
attrs==25.4.0
beautifulsoup4==4.14.2
botocore==1.40.61
certifi==2025.10.5
cffi==2.0.0
charset-normalizer==3.4.4
click==8.1.8
cobble==0.1.4
cryptography==46.0.3
datasets==4.0.0
decorator==5.2.1
deepdiff==8.6.1
dill==0.3.8
distro==1.9.0
docstring_parser==0.16
executing==2.2.1
fastapi==0.116.1
filelock==3.20.0
frozenlist==1.8.0
fsspec==2024.12.0
geographiclib==2.0
graphql-core==3.2.7
h11==0.16.0
hf-xet==1.2.0
httpcore==1.0.9
httpx==0.28.1
httpx-aiohttp==0.1.9
httpx-sse==0.4.3
huggingface-hub==0.33.4
idna==3.11
importlib_metadata==8.7.0
inputimeout==1.0.4
ipython==9.7.0
ipython_pygments_lexers==1.1.1
jedi==0.19.2
Jinja2==3.1.6
jiter==0.11.1
jmespath==1.0.1
joblib==1.4.2
jsonschema==4.25.1
jsonschema-specifications==2025.9.1
litellm==1.71.1
lxml==6.0.2
mammoth==1.8.0
markdown-it-py==4.0.0
markdownify==0.14.1
MarkupSafe==3.0.3
matplotlib-inline==0.2.1
mcp==1.11.0
mdurl==0.1.2
meta-agents-research-environments==1.1.0
multidict==6.7.0
multiprocess==0.70.16
mysql-connector-python==9.2.0
numpy==2.2.6
openai==2.7.1
orderly-set==5.5.0
packaging==25.0
pandas==2.2.3
parso==0.8.5
pathvalidate==3.2.1
pdfminer.six==20231228
pexpect==4.9.0
phonenumbers==8.13.53
pillow==11.1.0
polars-lts-cpu==1.33.1
prompt_toolkit==3.0.52
propcache==0.4.1
ptyprocess==0.7.0
pure_eval==0.2.3
puremagic==1.27
py==1.11.0
pyarrow==22.0.0
pycparser==2.23
pydantic==2.10.6
pydantic-settings==2.11.0
pydantic_core==2.27.2
Pygments==2.19.2
pypdf==6.0.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
python-pptx==1.0.2
pytz==2025.2
PyYAML==6.0.3
RapidFuzz==3.12.1
referencing==0.37.0
regex==2025.11.3
requests==2.32.5
retry==0.9.2
rich==14.2.0
rpds-py==0.28.0
s3fs==2024.12.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
soupsieve==2.8
sse-starlette==3.0.3
stack-data==0.6.3
starlette==0.47.2
strawberry-graphql==0.275.5
termcolor==2.5.0
tiktoken==0.12.0
tokenizers==0.22.1
tqdm==4.67.1
tqdm_joblib==0.0.4
traitlets==5.14.3
typer==0.20.0
typing-inspection==0.4.2
typing_extensions==4.15.0
tzdata==2025.2
urllib3==2.5.0
uvicorn==0.35.0
wcwidth==0.2.14
wrapt==1.17.3
wsproto==1.2.0
xlsxwriter==3.2.9
xxhash==3.6.0
yarl==1.22.0
zipp==3.23.0
📊 Impact Assessment
Severity:
- 🔴 Critical (benchmark completely unusable)
- 🟠 High (significantly impacts evaluation reliability)
- 🟡 Medium (minor impact on benchmark results)
- 🟢 Low (cosmetic or edge case)
Research Impact:
- Blocks research progress
- Affects result reliability
- Impacts performance comparisons
- Minor inconvenience
Reproducibility Impact:
- Makes results non-reproducible
- Causes inconsistent results
- Minor variation in results
- No impact on reproducibility
🔧 Workaround
Available Workaround:
- Yes (describe below)
- No
Workaround Description:
None. Need official commands/parameters/judge/traces to align evaluation protocol.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working