courselab: add reference solution script requirement to task structure

tareknaser · tareknaser · commit 57b962d63bf4 · 2026-02-05T08:31:54.000-08:00
diff --git a/benchmarks/courselab_bench/README.md b/benchmarks/courselab_bench/README.md
@@ -38,6 +38,7 @@ data/course_id/task_id/
 ├── task.md          # Problem statement shown to the agent
 ├── compose.yaml     # Docker sandbox configuration
 ├── evaluate.sh      # Grading script (exit 0 = pass, non-zero = fail)
+├── sol.sh           # Reference solution script to validate task solvability
 ├── preprocess.sh    # Optional: setup script run before agent starts
 └── starter/         # Optional: starter files copied to workspace
     └── *.py
@@ -47,6 +48,7 @@ data/course_id/task_id/
 - `task.md`: Problem description given to the agent
 - `compose.yaml`: Docker sandbox specification
 - `evaluate.sh`: Runs tests and returns exit code 0 for pass, non-zero for fail
+- `sol.sh`: Reference solution script that validates the task is solvable (executable bash script)
 - `preprocess.sh`: Optional setup script that runs before the agent starts (commonly used with `evaluate.sh` to verify test files remain unchanged via hashing)
 - `starter/`: Optional directory containing starter files that are copied to the workspace with the same relative paths
 
@@ -59,16 +61,17 @@ See [`data/distributed_systems/task_1_echo_server/`](data/distributed_systems/ta
 3. Add `config.json`, `task.md`, `compose.yaml`, and `evaluate.sh`
 4. Optionally add `starter/` directory with skeleton code
 5. Optionally add `preprocess.sh` for test file integrity checks
-6. Optionally add `sol.sh` as a reference solution script to validate task solvability (executable bash script that implements a working solution)
+6. Add `sol.sh` as a reference solution script to validate task solvability (see note below)
 7. Run tests to validate: `python -m pytest tests/test_data_schema.py -v`
 
-> Note on Sandboxing: Inspect AI offers multiple sandbox environments (Docker, Kubernetes, local, etc.). For simplicity, and because the majority of tasks won't require more than that, we currently expose a streamlined way to include tasks that use Docker sandboxing via `compose.yaml`. For more information regarding sandboxing and available environments in Inspect AI, see the [Sandboxing documentation](https://inspect.aisi.org.uk/sandboxing.html#environment-binding). If the lab you are adding requires a different sandboxing environment (e.g., Kubernetes), refer to the Inspect AI documentation.
+> **Note on Solution Scripts**: Every task must have a `sol.sh` reference solution to verify solvability. If you prefer not to have your solution publicly available, submit the entire lab to the [private repository](https://github.com/sys-intelligence/system-intelligence-benchmark-private/) instead.
+>
+> **Note on Sandboxing**: Inspect AI offers multiple sandbox environments (Docker, Kubernetes, local, etc.). For simplicity, and because the majority of tasks won't require more than that, we currently expose a streamlined way to include tasks that use Docker sandboxing via `compose.yaml`. For more information regarding sandboxing and available environments in Inspect AI, see the [Sandboxing documentation](https://inspect.aisi.org.uk/sandboxing.html#environment-binding). If the lab you are adding requires a different sandboxing environment (e.g., Kubernetes), refer to the Inspect AI documentation.
 
 ## Best Practices
 
 Tasks should be designed carefully and responsibly to ensure they are fair. Here are some best practices:
 
-- Write a validation script: It's highly recommended to write a reference solution script (`sol.sh`) to verify that there are actions an agent can take to pass the task. This gives confidence that the environment and task are set up correctly.
 - Test with at least one agent: If you test with multiple agents and LLMs and no agent solves the task, it's likely the task needs revision. Review the trajectories to understand whether failures have something in common.
 - Pin git commits for reproducibility: When cloning a git repository, specify the exact commit, then delete the `.git` directory to remove the full history.
 
diff --git a/benchmarks/courselab_bench/tests/test_data_schema.py b/benchmarks/courselab_bench/tests/test_data_schema.py
@@ -24,7 +24,7 @@ def test_tasks_found(self):
 
     def test_required_files_exist(self):
         task_folders = get_task_folders(DATA_DIR)
-        required_files = ["config.json", "task.md", "compose.yaml", "evaluate.sh"]
+        required_files = ['config.json', 'task.md', 'compose.yaml', 'evaluate.sh', 'sol.sh']
 
         for task_folder in task_folders:
             for filename in required_files: