Skip to content

Commit 57b962d

Browse files
committed
courselab: add reference solution script requirement to task structure
1 parent 7ef9288 commit 57b962d

File tree

2 files changed

+7
-4
lines changed

2 files changed

+7
-4
lines changed

benchmarks/courselab_bench/README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ data/course_id/task_id/
3838
├── task.md # Problem statement shown to the agent
3939
├── compose.yaml # Docker sandbox configuration
4040
├── evaluate.sh # Grading script (exit 0 = pass, non-zero = fail)
41+
├── sol.sh # Reference solution script to validate task solvability
4142
├── preprocess.sh # Optional: setup script run before agent starts
4243
└── starter/ # Optional: starter files copied to workspace
4344
└── *.py
@@ -47,6 +48,7 @@ data/course_id/task_id/
4748
- `task.md`: Problem description given to the agent
4849
- `compose.yaml`: Docker sandbox specification
4950
- `evaluate.sh`: Runs tests and returns exit code 0 for pass, non-zero for fail
51+
- `sol.sh`: Reference solution script that validates the task is solvable (executable bash script)
5052
- `preprocess.sh`: Optional setup script that runs before the agent starts (commonly used with `evaluate.sh` to verify test files remain unchanged via hashing)
5153
- `starter/`: Optional directory containing starter files that are copied to the workspace with the same relative paths
5254

@@ -59,16 +61,17 @@ See [`data/distributed_systems/task_1_echo_server/`](data/distributed_systems/ta
5961
3. Add `config.json`, `task.md`, `compose.yaml`, and `evaluate.sh`
6062
4. Optionally add `starter/` directory with skeleton code
6163
5. Optionally add `preprocess.sh` for test file integrity checks
62-
6. Optionally add `sol.sh` as a reference solution script to validate task solvability (executable bash script that implements a working solution)
64+
6. Add `sol.sh` as a reference solution script to validate task solvability (see note below)
6365
7. Run tests to validate: `python -m pytest tests/test_data_schema.py -v`
6466

65-
> Note on Sandboxing: Inspect AI offers multiple sandbox environments (Docker, Kubernetes, local, etc.). For simplicity, and because the majority of tasks won't require more than that, we currently expose a streamlined way to include tasks that use Docker sandboxing via `compose.yaml`. For more information regarding sandboxing and available environments in Inspect AI, see the [Sandboxing documentation](https://inspect.aisi.org.uk/sandboxing.html#environment-binding). If the lab you are adding requires a different sandboxing environment (e.g., Kubernetes), refer to the Inspect AI documentation.
67+
> **Note on Solution Scripts**: Every task must have a `sol.sh` reference solution to verify solvability. If you prefer not to have your solution publicly available, submit the entire lab to the [private repository](https://github.com/sys-intelligence/system-intelligence-benchmark-private/) instead.
68+
>
69+
> **Note on Sandboxing**: Inspect AI offers multiple sandbox environments (Docker, Kubernetes, local, etc.). For simplicity, and because the majority of tasks won't require more than that, we currently expose a streamlined way to include tasks that use Docker sandboxing via `compose.yaml`. For more information regarding sandboxing and available environments in Inspect AI, see the [Sandboxing documentation](https://inspect.aisi.org.uk/sandboxing.html#environment-binding). If the lab you are adding requires a different sandboxing environment (e.g., Kubernetes), refer to the Inspect AI documentation.
6670
6771
## Best Practices
6872

6973
Tasks should be designed carefully and responsibly to ensure they are fair. Here are some best practices:
7074

71-
- Write a validation script: It's highly recommended to write a reference solution script (`sol.sh`) to verify that there are actions an agent can take to pass the task. This gives confidence that the environment and task are set up correctly.
7275
- Test with at least one agent: If you test with multiple agents and LLMs and no agent solves the task, it's likely the task needs revision. Review the trajectories to understand whether failures have something in common.
7376
- Pin git commits for reproducibility: When cloning a git repository, specify the exact commit, then delete the `.git` directory to remove the full history.
7477

benchmarks/courselab_bench/tests/test_data_schema.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ def test_tasks_found(self):
2424

2525
def test_required_files_exist(self):
2626
task_folders = get_task_folders(DATA_DIR)
27-
required_files = ["config.json", "task.md", "compose.yaml", "evaluate.sh"]
27+
required_files = ['config.json', 'task.md', 'compose.yaml', 'evaluate.sh', 'sol.sh']
2828

2929
for task_folder in task_folders:
3030
for filename in required_files:

0 commit comments

Comments
 (0)