You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scout includes a reproducible eval harness that runs the agent against a snapshot dataset stored in Opik, so you can measure quality changes without hitting live GitHub issues on every run.
4
+
5
+
## How it works
6
+
7
+
Each dataset item uses a `scenario`/`spec`/`target_issue` shape. The `spec` is handed to `providers/scenarios.build()`, which constructs a `GitHubSimulator` — issues and writes are always simulated; file reads can be fully simulated (include a `files` key in the spec) or delegated to real GitHub (omit `files`, requires `GITHUB_TOKEN`).
8
+
9
+
## Setup
10
+
11
+
Add these to your `.env` alongside the standard Scout config (see `.env.example` for all variables):
12
+
13
+
| Var | Default | Description |
14
+
|---|---|---|
15
+
|`SCOUT_GITHUB_DATASET_NAME`|`scout-triage-inputs`| Opik dataset of real GitHub issues |
16
+
|`SCOUT_STARTER_DATASET_NAME`|`scout-starter-scenarios`| Opik dataset of synthetic starter scenarios |
17
+
|`SCOUT_EVAL_OPIK_PROJECT`|`scout-eval`| Opik project for eval traces |
18
+
|`SCOUT_EXPERIMENT_NAME`|`scout-offline-eval`| Experiment name prefix (timestamp appended per run) |
19
+
|`SCOUT_GITHUB_REPO_OWNER`| — | Required for real-GitHub file mode |
20
+
|`SCOUT_GITHUB_REPO_NAME`| — | Required for real-GitHub file mode |
21
+
22
+
## Datasets
23
+
24
+
Two separate Opik datasets are used:
25
+
26
+
| Dataset | Contents | Seeded from |
27
+
|---|---|---|
28
+
|`scout-triage-inputs`| Real issues from the GitHub repo |`fetch_github_issues.py` → `seed_dataset.py --from-github`|
Every item in an Opik dataset must follow this shape:
34
+
35
+
```json
36
+
{
37
+
"data": {
38
+
"scenario": "default",
39
+
"spec": {
40
+
"owner": "my-org",
41
+
"name": "my-repo",
42
+
"readme": "...",
43
+
"files": {"src/foo.py": "..."},
44
+
"issues": [
45
+
{
46
+
"number": 42,
47
+
"title": "...",
48
+
"body": "...",
49
+
"state": "open",
50
+
"author": "alice",
51
+
"labels": [],
52
+
"comments": []
53
+
}
54
+
]
55
+
},
56
+
"target_issue": 42
57
+
}
58
+
}
59
+
```
60
+
61
+
Omit `files` (and `readme`) inside `spec` to use real-GitHub mode — `list_directory`, `get_file_contents`, and `fetch_readme` will be fetched from the live repo via `GITHUB_TOKEN`.
62
+
63
+
## Step 1 — Fetch issues from GitHub
64
+
65
+
`evals/utils/fetch_github_issues.py` fetches real issues from any GitHub repo and saves them as JSON. It uses the GitHub search API with `is:issue` to exclude pull requests.
66
+
67
+
```bash
68
+
python evals/utils/fetch_github_issues.py --count 30 --state all --out github_issues.json
69
+
```
70
+
71
+
Options:
72
+
-`--repo owner/name` — override the repo (defaults to `SCOUT_GITHUB_REPO_OWNER`/`SCOUT_GITHUB_REPO_NAME`)
73
+
-`--count N` — number of issues to fetch (default: 10)
74
+
-`--state open|closed|all` — issue state filter (default: open)
Each script creates the dataset if it doesn't exist, or appends to an existing one. To start clean, delete the dataset in the Opik UI before re-seeding.
102
+
103
+
## Step 3 — Run the eval
104
+
105
+
```bash
106
+
python evals/run_offline_eval.py
107
+
```
108
+
109
+
By default this runs against `scout-triage-inputs`. Set `SCOUT_GITHUB_DATASET_NAME` in `.env` to target a different dataset.
110
+
111
+
Results and traces are logged to Opik under the project set in `SCOUT_EVAL_OPIK_PROJECT`. Each run gets a unique timestamped experiment name so results are easy to compare across runs.
112
+
113
+
The task returns `output` (the comment Scout posted), `final_labels`, `applied_labels`, and `search_queries`, so scoring metrics can reference side-effect state, not just the comment text.
0 commit comments