Skip to content

Commit 4d15516

Browse files
Merge pull request #29 from comet-ml/offline-evals-onto-main
feat: add offline eval harness and issue dataset tooling
2 parents d1bac82 + b0216d4 commit 4d15516

6 files changed

Lines changed: 558 additions & 1 deletion

File tree

.env.example

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# ── Core ──────────────────────────────────────────────────────────────────────
2+
3+
# Anthropic API key (required)
4+
ANTHROPIC_API_KEY=
5+
6+
# GitHub personal access token — required for real-GitHub file/search mode and
7+
# for the production triage action
8+
GITHUB_TOKEN=
9+
10+
# Target repository (required for evals and local runs; set automatically in
11+
# GitHub Actions via GITHUB_REPOSITORY)
12+
SCOUT_GITHUB_REPO_OWNER=
13+
SCOUT_GITHUB_REPO_NAME=
14+
15+
# Issue number to triage (set by the GitHub Actions workflow at runtime)
16+
ISSUE_NUMBER=
17+
18+
19+
# ── Opik ──────────────────────────────────────────────────────────────────────
20+
21+
OPIK_API_KEY=
22+
OPIK_WORKSPACE=
23+
24+
# Opik project all eval traces and datasets are grouped under
25+
SCOUT_EVAL_OPIK_PROJECT=scout:comet-ml/scout-test-repo
26+
27+
28+
# ── Scout agent ───────────────────────────────────────────────────────────────
29+
30+
# Label applied to escalated issues (default: "Escalated request")
31+
SCOUT_ESCALATION_TAG=Escalated request
32+
33+
# Opik prompt name and optional version pin for the system prompt
34+
# Omit version to always use the latest published version
35+
SCOUT_OPIK_PROMPT_NAME=
36+
# SCOUT_OPIK_PROMPT_VERSION=
37+
38+
# Override the system prompt inline or point to a local file (both optional)
39+
# SCOUT_SYSTEM_PROMPT=
40+
# SCOUT_PROMPT_FILE=
41+
42+
# Model and token limit overrides (optional)
43+
# SCOUT_MODEL=
44+
# SCOUT_MAX_TOKENS=
45+
46+
47+
# ── Datasets ──────────────────────────────────────────────────────────────────
48+
49+
# Opik dataset seeded from real GitHub issues via fetch_github_issues.py
50+
SCOUT_GITHUB_DATASET_NAME=scout-triage-inputs
51+
52+
# Opik dataset seeded from synthetic starter scenarios
53+
SCOUT_STARTER_DATASET_NAME=scout-starter-scenarios
54+
55+
# Experiment name prefix used by run_offline_eval.py (timestamp appended per run)
56+
# SCOUT_EXPERIMENT_NAME=scout-offline-eval
57+
58+
59+
# ── Feedback sync ─────────────────────────────────────────────────────────────
60+
61+
# How many days back to scan for Scout comment reactions (default: 7)
62+
# SCOUT_FEEDBACK_SINCE_DAYS=7

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,6 @@ __pycache__/
88
# Packaging artifacts (pip install -e . / python -m build)
99
build/
1010
dist/
11-
*.egg-info/
11+
*.egg-info/
12+
*.csv
13+
*.json

evals/README.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Offline evaluation
2+
3+
Scout includes a reproducible eval harness that runs the agent against a snapshot dataset stored in Opik, so you can measure quality changes without hitting live GitHub issues on every run.
4+
5+
## How it works
6+
7+
Each dataset item uses a `scenario`/`spec`/`target_issue` shape. The `spec` is handed to `providers/scenarios.build()`, which constructs a `GitHubSimulator` — issues and writes are always simulated; file reads can be fully simulated (include a `files` key in the spec) or delegated to real GitHub (omit `files`, requires `GITHUB_TOKEN`).
8+
9+
## Setup
10+
11+
Add these to your `.env` alongside the standard Scout config (see `.env.example` for all variables):
12+
13+
| Var | Default | Description |
14+
|---|---|---|
15+
| `SCOUT_GITHUB_DATASET_NAME` | `scout-triage-inputs` | Opik dataset of real GitHub issues |
16+
| `SCOUT_STARTER_DATASET_NAME` | `scout-starter-scenarios` | Opik dataset of synthetic starter scenarios |
17+
| `SCOUT_EVAL_OPIK_PROJECT` | `scout-eval` | Opik project for eval traces |
18+
| `SCOUT_EXPERIMENT_NAME` | `scout-offline-eval` | Experiment name prefix (timestamp appended per run) |
19+
| `SCOUT_GITHUB_REPO_OWNER` || Required for real-GitHub file mode |
20+
| `SCOUT_GITHUB_REPO_NAME` || Required for real-GitHub file mode |
21+
22+
## Datasets
23+
24+
Two separate Opik datasets are used:
25+
26+
| Dataset | Contents | Seeded from |
27+
|---|---|---|
28+
| `scout-triage-inputs` | Real issues from the GitHub repo | `fetch_github_issues.py``seed_dataset.py --from-github` |
29+
| `scout-starter-scenarios` | Synthetic fully-simulated scenarios | `seed_dataset.py --from-starter` |
30+
31+
## Dataset item format
32+
33+
Every item in an Opik dataset must follow this shape:
34+
35+
```json
36+
{
37+
"data": {
38+
"scenario": "default",
39+
"spec": {
40+
"owner": "my-org",
41+
"name": "my-repo",
42+
"readme": "...",
43+
"files": {"src/foo.py": "..."},
44+
"issues": [
45+
{
46+
"number": 42,
47+
"title": "...",
48+
"body": "...",
49+
"state": "open",
50+
"author": "alice",
51+
"labels": [],
52+
"comments": []
53+
}
54+
]
55+
},
56+
"target_issue": 42
57+
}
58+
}
59+
```
60+
61+
Omit `files` (and `readme`) inside `spec` to use real-GitHub mode — `list_directory`, `get_file_contents`, and `fetch_readme` will be fetched from the live repo via `GITHUB_TOKEN`.
62+
63+
## Step 1 — Fetch issues from GitHub
64+
65+
`evals/utils/fetch_github_issues.py` fetches real issues from any GitHub repo and saves them as JSON. It uses the GitHub search API with `is:issue` to exclude pull requests.
66+
67+
```bash
68+
python evals/utils/fetch_github_issues.py --count 30 --state all --out github_issues.json
69+
```
70+
71+
Options:
72+
- `--repo owner/name` — override the repo (defaults to `SCOUT_GITHUB_REPO_OWNER`/`SCOUT_GITHUB_REPO_NAME`)
73+
- `--count N` — number of issues to fetch (default: 10)
74+
- `--state open|closed|all` — issue state filter (default: open)
75+
- `--out FILE` — output path (default: `github_issues.json`)
76+
77+
> JSON output files are gitignored — don't commit them.
78+
79+
## Step 2 — Seed the datasets
80+
81+
`evals/utils/seed_dataset.py` inserts items into Opik datasets. The two sources seed separate datasets:
82+
83+
**Real GitHub issues**`scout-triage-inputs`:
84+
85+
```bash
86+
python evals/utils/seed_dataset.py --from-github github_issues.json
87+
```
88+
89+
**Synthetic starter scenarios**`scout-starter-scenarios`:
90+
91+
```bash
92+
python evals/utils/seed_dataset.py --from-starter
93+
```
94+
95+
**Both at once:**
96+
97+
```bash
98+
python evals/utils/seed_dataset.py --from-github github_issues.json --from-starter
99+
```
100+
101+
Each script creates the dataset if it doesn't exist, or appends to an existing one. To start clean, delete the dataset in the Opik UI before re-seeding.
102+
103+
## Step 3 — Run the eval
104+
105+
```bash
106+
python evals/run_offline_eval.py
107+
```
108+
109+
By default this runs against `scout-triage-inputs`. Set `SCOUT_GITHUB_DATASET_NAME` in `.env` to target a different dataset.
110+
111+
Results and traces are logged to Opik under the project set in `SCOUT_EVAL_OPIK_PROJECT`. Each run gets a unique timestamped experiment name so results are easy to compare across runs.
112+
113+
The task returns `output` (the comment Scout posted), `final_labels`, `applied_labels`, and `search_queries`, so scoring metrics can reference side-effect state, not just the comment text.

evals/run_offline_eval.py

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
#!/usr/bin/env python3
2+
"""Run Scout offline against an Opik dataset of triage scenarios.
3+
4+
Each dataset item must use the simulator format:
5+
{
6+
"scenario": "default", # builder name from providers/scenarios.py
7+
"spec": {
8+
"owner": "...", "name": "...",
9+
"readme": "...", # optional
10+
"files": {"src/foo.py": "..."}, # omit for real-GitHub file access
11+
"issues": [...]
12+
},
13+
"target_issue": 42
14+
}
15+
16+
Env vars (on top of the normal Scout config in .env):
17+
SCOUT_GITHUB_DATASET_NAME — Opik dataset to evaluate (default: "scout-triage-inputs")
18+
SCOUT_EVAL_OPIK_PROJECT — Opik project for traces (default: "scout-eval")
19+
SCOUT_EXPERIMENT_NAME — prefix for the experiment name (default: "scout-offline-eval")
20+
"""
21+
from __future__ import annotations
22+
23+
import logging
24+
import os
25+
import sys
26+
# scout.triage calls _get_issue_number() at module level; give it a dummy value so
27+
# the import succeeds — the eval never uses ISSUE_NUMBER from triage directly.
28+
os.environ.setdefault("ISSUE_NUMBER", "1")
29+
30+
# Resolve the repo root so relative imports work when run from the evals/ dir.
31+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
32+
33+
from datetime import datetime
34+
35+
import opik
36+
from opik.evaluation.metrics import Usefulness
37+
from dotenv import load_dotenv
38+
39+
load_dotenv(override=True)
40+
41+
from scout.agent import make_client, run_agent # noqa: E402
42+
from scout.providers.scenarios import build # noqa: E402
43+
from scout.triage import ( # noqa: E402
44+
ANTHROPIC_API_KEY,
45+
MAX_TOKENS,
46+
MODEL,
47+
SCOUT_ESCALATION_TAG,
48+
SCOUT_OPIK_PROMPT_NAME,
49+
load_system_prompt,
50+
)
51+
52+
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
53+
logger = logging.getLogger(__name__)
54+
55+
DATASET_NAME = os.environ.get("SCOUT_GITHUB_DATASET_NAME", "scout-triage-inputs")
56+
EVAL_OPIK_PROJECT = os.environ.get("SCOUT_EVAL_OPIK_PROJECT", "scout-eval")
57+
EXPERIMENT_NAME_PREFIX = os.environ.get("SCOUT_EXPERIMENT_NAME", "scout-offline-eval")
58+
59+
60+
def _experiment_name() -> str:
61+
return f"{EXPERIMENT_NAME_PREFIX}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
62+
63+
64+
def make_eval_task(system_prompt: str):
65+
client = make_client(ANTHROPIC_API_KEY, opik_project=EVAL_OPIK_PROJECT)
66+
67+
def eval_task(item: dict) -> dict:
68+
data = item.get("data", item)
69+
if "spec" not in data or "target_issue" not in data:
70+
keys = list(data.keys())
71+
raise ValueError(
72+
f"Dataset item is missing 'spec' or 'target_issue'. Got keys: {keys}. "
73+
"Re-seed the dataset using evals/utils/seed_offline_dataset.py — "
74+
"old CSV-format rows must be removed first (delete the dataset in the Opik UI)."
75+
)
76+
scenario = data.get("scenario", "default")
77+
spec = data["spec"]
78+
target = int(data["target_issue"])
79+
80+
sim = build(scenario, spec)
81+
logger.info("scenario=%s target=#%d", scenario, target)
82+
83+
comment, _trace_id = run_agent(
84+
sim,
85+
target,
86+
client=client,
87+
system_prompt=system_prompt,
88+
escalation_tag=SCOUT_ESCALATION_TAG,
89+
repo_owner=sim.owner,
90+
repo_name=sim.name,
91+
opik_project=EVAL_OPIK_PROJECT,
92+
model=MODEL,
93+
max_tokens=MAX_TOKENS,
94+
)
95+
96+
final_issue = sim.issue(target)
97+
applied = [c[2] for c in sim.calls if len(c) == 3 and c[0] == "apply_label"]
98+
searches = [c[1] for c in sim.calls if len(c) == 2 and c[0] == "search_issues"]
99+
100+
return {
101+
"input": {"target_issue": target, "issues": spec.get("issues", [])},
102+
"output": comment,
103+
"final_labels": final_issue["labels"],
104+
"applied_labels": applied,
105+
"search_queries": searches,
106+
}
107+
108+
return eval_task
109+
110+
111+
def main() -> None:
112+
opik_client = opik.Opik()
113+
dataset = opik_client.get_dataset(DATASET_NAME)
114+
system_prompt = load_system_prompt()
115+
prompt_obj = opik_client.get_chat_prompt(name=SCOUT_OPIK_PROMPT_NAME)
116+
117+
experiment_name = _experiment_name()
118+
logger.info("Dataset: %s | Experiment: %s", DATASET_NAME, experiment_name)
119+
120+
opik.evaluate(
121+
dataset=dataset,
122+
task=make_eval_task(system_prompt),
123+
experiment_name=experiment_name,
124+
project_name=EVAL_OPIK_PROJECT,
125+
experiment_config={"model": MODEL, "max_tokens": MAX_TOKENS, "prompt_name": SCOUT_OPIK_PROMPT_NAME},
126+
prompts=[prompt_obj] if prompt_obj else [],
127+
scoring_metrics=[Usefulness(model="claude-sonnet-4-6")],
128+
)
129+
130+
logger.info("Done — view results in the Opik UI under project '%s'", EVAL_OPIK_PROJECT)
131+
132+
133+
if __name__ == "__main__":
134+
main()

0 commit comments

Comments
 (0)