You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
├── sol.sh # Reference solution script to validate task solvability
41
42
├── preprocess.sh # Optional: setup script run before agent starts
42
43
└── starter/ # Optional: starter files copied to workspace
43
44
└── *.py
@@ -47,6 +48,7 @@ data/course_id/task_id/
47
48
-`task.md`: Problem description given to the agent
48
49
-`compose.yaml`: Docker sandbox specification
49
50
-`evaluate.sh`: Runs tests and returns exit code 0 for pass, non-zero for fail
51
+
-`sol.sh`: Reference solution script that validates the task is solvable (executable bash script)
50
52
-`preprocess.sh`: Optional setup script that runs before the agent starts (commonly used with `evaluate.sh` to verify test files remain unchanged via hashing)
51
53
-`starter/`: Optional directory containing starter files that are copied to the workspace with the same relative paths
52
54
@@ -59,16 +61,17 @@ See [`data/distributed_systems/task_1_echo_server/`](data/distributed_systems/ta
59
61
3. Add `config.json`, `task.md`, `compose.yaml`, and `evaluate.sh`
60
62
4. Optionally add `starter/` directory with skeleton code
61
63
5. Optionally add `preprocess.sh` for test file integrity checks
62
-
6.Optionally add `sol.sh` as a reference solution script to validate task solvability (executable bash script that implements a working solution)
64
+
6.Add `sol.sh` as a reference solution script to validate task solvability (see note below)
63
65
7. Run tests to validate: `python -m pytest tests/test_data_schema.py -v`
64
66
65
-
> Note on Sandboxing: Inspect AI offers multiple sandbox environments (Docker, Kubernetes, local, etc.). For simplicity, and because the majority of tasks won't require more than that, we currently expose a streamlined way to include tasks that use Docker sandboxing via `compose.yaml`. For more information regarding sandboxing and available environments in Inspect AI, see the [Sandboxing documentation](https://inspect.aisi.org.uk/sandboxing.html#environment-binding). If the lab you are adding requires a different sandboxing environment (e.g., Kubernetes), refer to the Inspect AI documentation.
67
+
> **Note on Solution Scripts**: Every task must have a `sol.sh` reference solution to verify solvability. If you prefer not to have your solution publicly available, submit the entire lab to the [private repository](https://github.com/sys-intelligence/system-intelligence-benchmark-private/) instead.
68
+
>
69
+
> **Note on Sandboxing**: Inspect AI offers multiple sandbox environments (Docker, Kubernetes, local, etc.). For simplicity, and because the majority of tasks won't require more than that, we currently expose a streamlined way to include tasks that use Docker sandboxing via `compose.yaml`. For more information regarding sandboxing and available environments in Inspect AI, see the [Sandboxing documentation](https://inspect.aisi.org.uk/sandboxing.html#environment-binding). If the lab you are adding requires a different sandboxing environment (e.g., Kubernetes), refer to the Inspect AI documentation.
66
70
67
71
## Best Practices
68
72
69
73
Tasks should be designed carefully and responsibly to ensure they are fair. Here are some best practices:
70
74
71
-
- Write a validation script: It's highly recommended to write a reference solution script (`sol.sh`) to verify that there are actions an agent can take to pass the task. This gives confidence that the environment and task are set up correctly.
72
75
- Test with at least one agent: If you test with multiple agents and LLMs and no agent solves the task, it's likely the task needs revision. Review the trajectories to understand whether failures have something in common.
73
76
- Pin git commits for reproducibility: When cloning a git repository, specify the exact commit, then delete the `.git` directory to remove the full history.
0 commit comments