[BENCHMARK BUG] Failed to load dataset

**📊 Benchmark Bug Description**:

- What benchmark or evaluation is affected?
**Gaia2**

- What unexpected behavior did you observe?
But still getting "Failed to load dataset", even though the latest cached version of the dataset has been downloaded to the local directory.

- What were you trying to measure or evaluate?
**Use the pre-downloaded meta-agents-research-environments/gaia2 to walk through the mock evaluation process.**


**🎯 Benchmark Bug Category**:

- [ ] 📈 Performance Issue (slow execution, memory usage, timeouts)
- [ ] 📊 Metrics Issue (incorrect calculations, missing metrics)
- [ ] 🔄 Reproducibility Issue (inconsistent results across runs)
- [ ] 💾 Data Issue (incorrect datasets, missing data, data corruption)
- [ ] ⚖️ Evaluation Issue (scoring problems, comparison errors)
- [x] 🏃 Execution Issue (benchmark fails to run, crashes during evaluation)

## 📋 Benchmark Details
**Benchmark Information**:
- Benchmark Name: Gaia2
- Benchmark Version/Commit: [e.g., commit hash, tag, branch]
- Dataset Used: meta-agents-research-environments/gaia2
- Dataset Split: validation
- Dataset Config: mini, execution, search, adaptability, time, ambiguity


**Command Used**:
```bash
# Check if the dataset exists
ls /home/martin/.cache/huggingface/datasets/meta-agents-research-environments___gaia2/ 
adaptability  ambiguity  demo  execution  mini  search  time

# run
are-benchmark gaia2-run --hf-dataset meta-agents-research-environments/gaia2  --hf-split validation  --model gpt-4o --provider mock
```

## 🔄 Steps to Reproduce
**Environment Setup**:
1. Dataset preparation: ...
2. Model/Agent configuration: ...
3. Environment setup: ...
4. Command execution: ...

**Reproduction Steps**:
1. Step 1
2. Step 2
3. Step 3
4. Observe the issue

**Frequency**:
- [x] Always happens
- [ ] Happens most of the time (>75%)
- [ ] Happens sometimes (25-75%)
- [ ] Rarely happens (<25%)
- [ ] Only under specific conditions (describe below)


## 📋 Error Information
**Error Messages**:
```
2025-10-10 17:00:16,105 - MainThread - WARNING - datasets.load - Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 17:00:16,106 - MainThread - ERROR - are.simulation.benchmark.huggingface_loader - Failed to load dataset meta-agents-research-environments/gaia2/mini/validation: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'mini-ac06a8d2d8a73f66'
```

**Stack Trace** (if applicable):
```
Traceback (most recent call last):
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/are/simulation/benchmark/huggingface_loader.py", line 163, in _create_huggingface_scenario_iterator
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1392, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1166, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 124, in __init__
    config_name, version, hash = _find_hash_in_cache(
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 64, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'mini-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
```

**Log Output**:
```
are-benchmark gaia2-run --hf-dataset meta-agents-research-environments/gaia2  --hf-split validation  --model mock --provider mock -l 1
2025-10-10 16:56:40,741 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - No output directory specified. Using default: ./gaia2_results
2025-10-10 16:56:40,742 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Starting GAIA2 submission pipeline...
2025-10-10 16:56:40,742 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Standard phase will run configs: ambiguity, adaptability, execution, search, time
2025-10-10 16:56:40,742 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Agent2Agent and Noise phases will run mini config only
2025-10-10 16:56:40,742 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - This includes: standard runs, agent2agent runs, and noise runs
2025-10-10 16:56:40,742 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - === Phase: Running standard configurations ===
2025-10-10 16:56:40,742 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Running standard config: ambiguity
Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:57:11,478 - MainThread - WARNING - datasets.load - Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:57:11,479 - MainThread - ERROR - are.simulation.benchmark.huggingface_loader - Failed to load dataset meta-agents-research-environments/gaia2/ambiguity/validation: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'ambiguity-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
Traceback (most recent call last):
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/are/simulation/benchmark/huggingface_loader.py", line 163, in _create_huggingface_scenario_iterator
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1392, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1166, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 124, in __init__
    config_name, version, hash = _find_hash_in_cache(
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 64, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'ambiguity-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
2025-10-10 16:57:11,497 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Running each scenario 3 times to improve variance
2025-10-10 16:57:11,498 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Starting.
2025-10-10 16:57:11,498 - MainThread - INFO - are.simulation.multi_scenario_runner - Running scenarios in parallel with 5 workers
Running Ambiguity scenarios: 0it [00:00, ?it/s, Success=0.0%]
2025-10-10 16:57:11,499 - MainThread - ERROR - are.simulation.benchmark.gaia2_submission - Phase 'standard' config 'ambiguity' failed with error: No scenarios processed
2025-10-10 16:57:11,499 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Running standard config: adaptability
Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:57:42,248 - MainThread - WARNING - datasets.load - Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:57:42,249 - MainThread - ERROR - are.simulation.benchmark.huggingface_loader - Failed to load dataset meta-agents-research-environments/gaia2/adaptability/validation: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'adaptability-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
Traceback (most recent call last):
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/are/simulation/benchmark/huggingface_loader.py", line 163, in _create_huggingface_scenario_iterator
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1392, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1166, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 124, in __init__
    config_name, version, hash = _find_hash_in_cache(
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 64, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'adaptability-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
2025-10-10 16:57:42,249 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Running each scenario 3 times to improve variance
2025-10-10 16:57:42,250 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Starting.
2025-10-10 16:57:42,250 - MainThread - INFO - are.simulation.multi_scenario_runner - Running scenarios in parallel with 5 workers
Running Adaptability scenarios: 0it [00:00, ?it/s, Success=0.0%]
2025-10-10 16:57:42,250 - MainThread - ERROR - are.simulation.benchmark.gaia2_submission - Phase 'standard' config 'adaptability' failed with error: No scenarios processed
2025-10-10 16:57:42,250 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Running standard config: execution
Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:58:13,013 - MainThread - WARNING - datasets.load - Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:58:13,014 - MainThread - ERROR - are.simulation.benchmark.huggingface_loader - Failed to load dataset meta-agents-research-environments/gaia2/execution/validation: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'execution-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
Traceback (most recent call last):
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/are/simulation/benchmark/huggingface_loader.py", line 163, in _create_huggingface_scenario_iterator
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1392, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1166, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 124, in __init__
    config_name, version, hash = _find_hash_in_cache(
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 64, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'execution-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
2025-10-10 16:58:13,014 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Running each scenario 3 times to improve variance
2025-10-10 16:58:13,014 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Starting.
2025-10-10 16:58:13,015 - MainThread - INFO - are.simulation.multi_scenario_runner - Running scenarios in parallel with 5 workers
Running Execution scenarios: 0it [00:00, ?it/s, Success=0.0%]
2025-10-10 16:58:13,015 - MainThread - ERROR - are.simulation.benchmark.gaia2_submission - Phase 'standard' config 'execution' failed with error: No scenarios processed
2025-10-10 16:58:13,015 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Running standard config: search
Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:58:43,738 - MainThread - WARNING - datasets.load - Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:58:43,739 - MainThread - ERROR - are.simulation.benchmark.huggingface_loader - Failed to load dataset meta-agents-research-environments/gaia2/search/validation: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'search-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
Traceback (most recent call last):
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/are/simulation/benchmark/huggingface_loader.py", line 163, in _create_huggingface_scenario_iterator
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1392, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1166, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 124, in __init__
    config_name, version, hash = _find_hash_in_cache(
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 64, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'search-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
2025-10-10 16:58:43,739 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Running each scenario 3 times to improve variance
2025-10-10 16:58:43,739 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Starting.
2025-10-10 16:58:43,740 - MainThread - INFO - are.simulation.multi_scenario_runner - Running scenarios in parallel with 5 workers
Running Search scenarios: 0it [00:00, ?it/s, Success=0.0%]
2025-10-10 16:58:43,740 - MainThread - ERROR - are.simulation.benchmark.gaia2_submission - Phase 'standard' config 'search' failed with error: No scenarios processed
2025-10-10 16:58:43,740 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Running standard config: time
Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:59:14,563 - MainThread - WARNING - datasets.load - Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:59:14,564 - MainThread - ERROR - are.simulation.benchmark.huggingface_loader - Failed to load dataset meta-agents-research-environments/gaia2/time/validation: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'time-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
Traceback (most recent call last):
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/are/simulation/benchmark/huggingface_loader.py", line 163, in _create_huggingface_scenario_iterator
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1392, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1166, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 124, in __init__
    config_name, version, hash = _find_hash_in_cache(
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 64, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'time-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
2025-10-10 16:59:14,564 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Running each scenario 3 times to improve variance
2025-10-10 16:59:14,564 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Starting.
2025-10-10 16:59:14,564 - MainThread - INFO - are.simulation.multi_scenario_runner - Running scenarios in parallel with 5 workers
Running Time scenarios: 0it [00:00, ?it/s, Success=0.0%]
2025-10-10 16:59:14,565 - MainThread - ERROR - are.simulation.benchmark.gaia2_submission - Phase 'standard' config 'time' failed with error: No scenarios processed
2025-10-10 16:59:14,565 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - === Phase: Running agent2agent configurations ===
2025-10-10 16:59:14,565 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Running agent2agent config: mini
Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:59:45,336 - MainThread - WARNING - datasets.load - Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 16:59:45,337 - MainThread - ERROR - are.simulation.benchmark.huggingface_loader - Failed to load dataset meta-agents-research-environments/gaia2/mini/validation: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'mini-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
Traceback (most recent call last):
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/are/simulation/benchmark/huggingface_loader.py", line 163, in _create_huggingface_scenario_iterator
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1392, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1166, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 124, in __init__
    config_name, version, hash = _find_hash_in_cache(
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 64, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'mini-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
2025-10-10 16:59:45,338 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Running each scenario 3 times to improve variance
2025-10-10 16:59:45,338 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Starting.
2025-10-10 16:59:45,338 - MainThread - INFO - are.simulation.multi_scenario_runner - Running scenarios in parallel with 5 workers
Running Mini scenarios with Agent2Agent: 0it [00:00, ?it/s, Success=0.0%]
2025-10-10 16:59:45,338 - MainThread - ERROR - are.simulation.benchmark.gaia2_submission - Phase 'agent2agent' config 'mini' failed with error: No scenarios processed
2025-10-10 16:59:45,338 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - === Phase: Running noise configurations ===
2025-10-10 16:59:45,339 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - Running noise config: mini
Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 17:00:16,105 - MainThread - WARNING - datasets.load - Using the latest cached version of the dataset since meta-agents-research-environments/gaia2 couldn't be found on the Hugging Face Hub
2025-10-10 17:00:16,106 - MainThread - ERROR - are.simulation.benchmark.huggingface_loader - Failed to load dataset meta-agents-research-environments/gaia2/mini/validation: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'mini-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
Traceback (most recent call last):
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/are/simulation/benchmark/huggingface_loader.py", line 163, in _create_huggingface_scenario_iterator
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1392, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/load.py", line 1166, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 124, in __init__
    config_name, version, hash = _find_hash_in_cache(
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/data/repo/are-gaia2/.venv/lib/python3.12/site-packages/datasets/packaged_modules/cache/cache.py", line 64, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for meta-agents-research-environments/gaia2 for config 'mini-ac06a8d2d8a73f66'
Available configs in the cache: ['adaptability', 'ambiguity', 'demo', 'execution', 'mini', 'search', 'time']
2025-10-10 17:00:16,106 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Running each scenario 3 times to improve variance
2025-10-10 17:00:16,106 - MainThread - INFO - are.simulation.benchmark.scenario_executor - Starting.
2025-10-10 17:00:16,106 - MainThread - INFO - are.simulation.multi_scenario_runner - Running scenarios in parallel with 5 workers
Running Mini scenarios with Noise: 0it [00:00, ?it/s, Success=0.0%]
2025-10-10 17:00:16,107 - MainThread - ERROR - are.simulation.benchmark.gaia2_submission - Phase 'noise' config 'mini' failed with error: No scenarios processed
2025-10-10 17:00:16,107 - MainThread - WARNING - are.simulation.benchmark.gaia2_submission - The following 7 phase/config(s) failed entirely and were skipped:
2025-10-10 17:00:16,107 - MainThread - WARNING - are.simulation.benchmark.gaia2_submission -   - standard/ambiguity
2025-10-10 17:00:16,107 - MainThread - WARNING - are.simulation.benchmark.gaia2_submission -   - standard/adaptability
2025-10-10 17:00:16,107 - MainThread - WARNING - are.simulation.benchmark.gaia2_submission -   - standard/execution
2025-10-10 17:00:16,107 - MainThread - WARNING - are.simulation.benchmark.gaia2_submission -   - standard/search
2025-10-10 17:00:16,107 - MainThread - WARNING - are.simulation.benchmark.gaia2_submission -   - standard/time
2025-10-10 17:00:16,107 - MainThread - WARNING - are.simulation.benchmark.gaia2_submission -   - agent2agent/mini
2025-10-10 17:00:16,107 - MainThread - WARNING - are.simulation.benchmark.gaia2_submission -   - noise/mini
2025-10-10 17:00:16,107 - MainThread - WARNING - are.simulation.benchmark.gaia2_submission - Check the logs above for specific error details.
2025-10-10 17:00:16,107 - MainThread - INFO - are.simulation.benchmark.gaia2_submission - GAIA2 submission summary:
2025-10-10 17:00:16,107 - MainThread - INFO - are.simulation.benchmark.gaia2_submission -   Total phase/configs attempted: 7
2025-10-10 17:00:16,107 - MainThread - INFO - are.simulation.benchmark.gaia2_submission -   Successful phase/configs: 0
2025-10-10 17:00:16,107 - MainThread - INFO - are.simulation.benchmark.gaia2_submission -   Failed phase/configs: 7
2025-10-10 17:00:16,107 - MainThread - INFO - are.simulation.benchmark.gaia2_submission -   Total scenarios processed: 0
2025-10-10 17:00:16,107 - MainThread - ERROR - are.simulation.benchmark.gaia2_submission - All phase/configs failed. No results generated.
2025-10-10 17:00:16,108 - MainThread - WARNING - are.simulation.benchmark.cli - No results to report.
2025-10-10 17:00:16,108 - MainThread - INFO - are.simulation.benchmark.cli - All Done.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BENCHMARK BUG] Failed to load dataset #12

📋 Benchmark Details

🔄 Steps to Reproduce

📋 Error Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BENCHMARK BUG] Failed to load dataset #12

Description

📋 Benchmark Details

🔄 Steps to Reproduce

📋 Error Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions