Merge pull request #29 from comet-ml/offline-evals-onto-main

LeoRoccoBreedt · web-flow · commit 4d1551697e8f · 2026-06-16T20:58:30.000+02:00
feat: add offline eval harness and issue dataset tooling
diff --git a/.env.example b/.env.example
@@ -0,0 +1,62 @@
+# ── Core ──────────────────────────────────────────────────────────────────────
+
+# Anthropic API key (required)
+ANTHROPIC_API_KEY=
+
+# GitHub personal access token — required for real-GitHub file/search mode and
+# for the production triage action
+GITHUB_TOKEN=
+
+# Target repository (required for evals and local runs; set automatically in
+# GitHub Actions via GITHUB_REPOSITORY)
+SCOUT_GITHUB_REPO_OWNER=
+SCOUT_GITHUB_REPO_NAME=
+
+# Issue number to triage (set by the GitHub Actions workflow at runtime)
+ISSUE_NUMBER=
+
+
+# ── Opik ──────────────────────────────────────────────────────────────────────
+
+OPIK_API_KEY=
+OPIK_WORKSPACE=
+
+# Opik project all eval traces and datasets are grouped under
+SCOUT_EVAL_OPIK_PROJECT=scout:comet-ml/scout-test-repo
+
+
+# ── Scout agent ───────────────────────────────────────────────────────────────
+
+# Label applied to escalated issues (default: "Escalated request")
+SCOUT_ESCALATION_TAG=Escalated request
+
+# Opik prompt name and optional version pin for the system prompt
+# Omit version to always use the latest published version
+SCOUT_OPIK_PROMPT_NAME=
+# SCOUT_OPIK_PROMPT_VERSION=
+
+# Override the system prompt inline or point to a local file (both optional)
+# SCOUT_SYSTEM_PROMPT=
+# SCOUT_PROMPT_FILE=
+
+# Model and token limit overrides (optional)
+# SCOUT_MODEL=
+# SCOUT_MAX_TOKENS=
+
+
+# ── Datasets ──────────────────────────────────────────────────────────────────
+
+# Opik dataset seeded from real GitHub issues via fetch_github_issues.py
+SCOUT_GITHUB_DATASET_NAME=scout-triage-inputs
+
+# Opik dataset seeded from synthetic starter scenarios
+SCOUT_STARTER_DATASET_NAME=scout-starter-scenarios
+
+# Experiment name prefix used by run_offline_eval.py (timestamp appended per run)
+# SCOUT_EXPERIMENT_NAME=scout-offline-eval
+
+
+# ── Feedback sync ─────────────────────────────────────────────────────────────
+
+# How many days back to scan for Scout comment reactions (default: 7)
+# SCOUT_FEEDBACK_SINCE_DAYS=7
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,6 @@ __pycache__/
 # Packaging artifacts (pip install -e . / python -m build)
 build/
 dist/
-*.egg-info/
+*.egg-info/
+*.csv
+*.json
diff --git a/evals/README.md b/evals/README.md
@@ -0,0 +1,113 @@
+# Offline evaluation
+
+Scout includes a reproducible eval harness that runs the agent against a snapshot dataset stored in Opik, so you can measure quality changes without hitting live GitHub issues on every run.
+
+## How it works
+
+Each dataset item uses a `scenario`/`spec`/`target_issue` shape. The `spec` is handed to `providers/scenarios.build()`, which constructs a `GitHubSimulator` — issues and writes are always simulated; file reads can be fully simulated (include a `files` key in the spec) or delegated to real GitHub (omit `files`, requires `GITHUB_TOKEN`).
+
+## Setup
+
+Add these to your `.env` alongside the standard Scout config (see `.env.example` for all variables):
+
+| Var | Default | Description |
+|---|---|---|
+| `SCOUT_GITHUB_DATASET_NAME` | `scout-triage-inputs` | Opik dataset of real GitHub issues |
+| `SCOUT_STARTER_DATASET_NAME` | `scout-starter-scenarios` | Opik dataset of synthetic starter scenarios |
+| `SCOUT_EVAL_OPIK_PROJECT` | `scout-eval` | Opik project for eval traces |
+| `SCOUT_EXPERIMENT_NAME` | `scout-offline-eval` | Experiment name prefix (timestamp appended per run) |
+| `SCOUT_GITHUB_REPO_OWNER` | — | Required for real-GitHub file mode |
+| `SCOUT_GITHUB_REPO_NAME` | — | Required for real-GitHub file mode |
+
+## Datasets
+
+Two separate Opik datasets are used:
+
+| Dataset | Contents | Seeded from |
+|---|---|---|
+| `scout-triage-inputs` | Real issues from the GitHub repo | `fetch_github_issues.py` → `seed_dataset.py --from-github` |
+| `scout-starter-scenarios` | Synthetic fully-simulated scenarios | `seed_dataset.py --from-starter` |
+
+## Dataset item format
+
+Every item in an Opik dataset must follow this shape:
+
+```json
+{
+  "data": {
+    "scenario": "default",
+    "spec": {
+      "owner": "my-org",
+      "name": "my-repo",
+      "readme": "...",
+      "files": {"src/foo.py": "..."},
+      "issues": [
+        {
+          "number": 42,
+          "title": "...",
+          "body": "...",
+          "state": "open",
+          "author": "alice",
+          "labels": [],
+          "comments": []
+        }
+      ]
+    },
+    "target_issue": 42
+  }
+}
+```
+
+Omit `files` (and `readme`) inside `spec` to use real-GitHub mode — `list_directory`, `get_file_contents`, and `fetch_readme` will be fetched from the live repo via `GITHUB_TOKEN`.
+
+## Step 1 — Fetch issues from GitHub
+
+`evals/utils/fetch_github_issues.py` fetches real issues from any GitHub repo and saves them as JSON. It uses the GitHub search API with `is:issue` to exclude pull requests.
+
+```bash
+python evals/utils/fetch_github_issues.py --count 30 --state all --out github_issues.json
+```
+
+Options:
+- `--repo owner/name` — override the repo (defaults to `SCOUT_GITHUB_REPO_OWNER`/`SCOUT_GITHUB_REPO_NAME`)
+- `--count N` — number of issues to fetch (default: 10)
+- `--state open|closed|all` — issue state filter (default: open)
+- `--out FILE` — output path (default: `github_issues.json`)
+
+> JSON output files are gitignored — don't commit them.
+
+## Step 2 — Seed the datasets
+
+`evals/utils/seed_dataset.py` inserts items into Opik datasets. The two sources seed separate datasets:
+
+**Real GitHub issues** → `scout-triage-inputs`:
+
+```bash
+python evals/utils/seed_dataset.py --from-github github_issues.json
+```
+
+**Synthetic starter scenarios** → `scout-starter-scenarios`:
+
+```bash
+python evals/utils/seed_dataset.py --from-starter
+```
+
+**Both at once:**
+
+```bash
+python evals/utils/seed_dataset.py --from-github github_issues.json --from-starter
+```
+
+Each script creates the dataset if it doesn't exist, or appends to an existing one. To start clean, delete the dataset in the Opik UI before re-seeding.
+
+## Step 3 — Run the eval
+
+```bash
+python evals/run_offline_eval.py
+```
+
+By default this runs against `scout-triage-inputs`. Set `SCOUT_GITHUB_DATASET_NAME` in `.env` to target a different dataset.
+
+Results and traces are logged to Opik under the project set in `SCOUT_EVAL_OPIK_PROJECT`. Each run gets a unique timestamped experiment name so results are easy to compare across runs.
+
+The task returns `output` (the comment Scout posted), `final_labels`, `applied_labels`, and `search_queries`, so scoring metrics can reference side-effect state, not just the comment text.
diff --git a/evals/run_offline_eval.py b/evals/run_offline_eval.py
@@ -0,0 +1,134 @@
+#!/usr/bin/env python3
+"""Run Scout offline against an Opik dataset of triage scenarios.
+
+Each dataset item must use the simulator format:
+    {
+      "scenario": "default",          # builder name from providers/scenarios.py
+      "spec": {
+        "owner": "...", "name": "...",
+        "readme": "...",               # optional
+        "files": {"src/foo.py": "..."}, # omit for real-GitHub file access
+        "issues": [...]
+      },
+      "target_issue": 42
+    }
+
+Env vars (on top of the normal Scout config in .env):
+    SCOUT_GITHUB_DATASET_NAME  — Opik dataset to evaluate (default: "scout-triage-inputs")
+    SCOUT_EVAL_OPIK_PROJECT    — Opik project for traces (default: "scout-eval")
+    SCOUT_EXPERIMENT_NAME      — prefix for the experiment name (default: "scout-offline-eval")
+"""
+from __future__ import annotations
+
+import logging
+import os
+import sys
+# scout.triage calls _get_issue_number() at module level; give it a dummy value so
+# the import succeeds — the eval never uses ISSUE_NUMBER from triage directly.
+os.environ.setdefault("ISSUE_NUMBER", "1")
+
+# Resolve the repo root so relative imports work when run from the evals/ dir.
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+
+from datetime import datetime
+
+import opik
+from opik.evaluation.metrics import Usefulness
+from dotenv import load_dotenv
+
+load_dotenv(override=True)
+
+from scout.agent import make_client, run_agent  # noqa: E402
+from scout.providers.scenarios import build  # noqa: E402
+from scout.triage import (  # noqa: E402
+    ANTHROPIC_API_KEY,
+    MAX_TOKENS,
+    MODEL,
+    SCOUT_ESCALATION_TAG,
+    SCOUT_OPIK_PROMPT_NAME,
+    load_system_prompt,
+)
+
+logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+logger = logging.getLogger(__name__)
+
+DATASET_NAME = os.environ.get("SCOUT_GITHUB_DATASET_NAME", "scout-triage-inputs")
+EVAL_OPIK_PROJECT = os.environ.get("SCOUT_EVAL_OPIK_PROJECT", "scout-eval")
+EXPERIMENT_NAME_PREFIX = os.environ.get("SCOUT_EXPERIMENT_NAME", "scout-offline-eval")
+
+
+def _experiment_name() -> str:
+    return f"{EXPERIMENT_NAME_PREFIX}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
+
+
+def make_eval_task(system_prompt: str):
+    client = make_client(ANTHROPIC_API_KEY, opik_project=EVAL_OPIK_PROJECT)
+
+    def eval_task(item: dict) -> dict:
+        data = item.get("data", item)
+        if "spec" not in data or "target_issue" not in data:
+            keys = list(data.keys())
+            raise ValueError(
+                f"Dataset item is missing 'spec' or 'target_issue'. Got keys: {keys}. "
+                "Re-seed the dataset using evals/utils/seed_offline_dataset.py — "
+                "old CSV-format rows must be removed first (delete the dataset in the Opik UI)."
+            )
+        scenario = data.get("scenario", "default")
+        spec = data["spec"]
+        target = int(data["target_issue"])
+
+        sim = build(scenario, spec)
+        logger.info("scenario=%s target=#%d", scenario, target)
+
+        comment, _trace_id = run_agent(
+            sim,
+            target,
+            client=client,
+            system_prompt=system_prompt,
+            escalation_tag=SCOUT_ESCALATION_TAG,
+            repo_owner=sim.owner,
+            repo_name=sim.name,
+            opik_project=EVAL_OPIK_PROJECT,
+            model=MODEL,
+            max_tokens=MAX_TOKENS,
+        )
+
+        final_issue = sim.issue(target)
+        applied = [c[2] for c in sim.calls if len(c) == 3 and c[0] == "apply_label"]
+        searches = [c[1] for c in sim.calls if len(c) == 2 and c[0] == "search_issues"]
+
+        return {
+            "input": {"target_issue": target, "issues": spec.get("issues", [])},
+            "output": comment,
+            "final_labels": final_issue["labels"],
+            "applied_labels": applied,
+            "search_queries": searches,
+        }
+
+    return eval_task
+
+
+def main() -> None:
+    opik_client = opik.Opik()
+    dataset = opik_client.get_dataset(DATASET_NAME)
+    system_prompt = load_system_prompt()
+    prompt_obj = opik_client.get_chat_prompt(name=SCOUT_OPIK_PROMPT_NAME)
+
+    experiment_name = _experiment_name()
+    logger.info("Dataset: %s  |  Experiment: %s", DATASET_NAME, experiment_name)
+
+    opik.evaluate(
+        dataset=dataset,
+        task=make_eval_task(system_prompt),
+        experiment_name=experiment_name,
+        project_name=EVAL_OPIK_PROJECT,
+        experiment_config={"model": MODEL, "max_tokens": MAX_TOKENS, "prompt_name": SCOUT_OPIK_PROMPT_NAME},
+        prompts=[prompt_obj] if prompt_obj else [],
+        scoring_metrics=[Usefulness(model="claude-sonnet-4-6")],
+    )
+
+    logger.info("Done — view results in the Opik UI under project '%s'", EVAL_OPIK_PROJECT)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/evals/utils/fetch_github_issues.py b/evals/utils/fetch_github_issues.py
diff --git a/evals/utils/seed_dataset.py b/evals/utils/seed_dataset.py