You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: restructure into a multi-action package (#26)
Move Scout's flat top-level scripts into an installable `scout` package
(src/scout/) and split the distributable GitHub Actions so the repo can
host more than one. Each action is now a thin composite wrapper that
pip-installs the package and runs a console-script entry point — no
checkout/requirements coupling in consumer workflows.
- src/scout/: triage.py, feedback.py, agent.py, init.py, providers/
- evals/: run_eval.py (was scout_eval.py) + seeders/scenarios (dev-only)
- tests/: test_triage / test_feedback / test_init
- pyproject.toml: build-system, [project.scripts]
(scout-triage / scout-feedback / scout-init), dev extras, pytest pythonpath
- action.yml: triage wrapper (root path preserved for existing consumers)
- actions/feedback/action.yml: new feedback-sync action
- remove scout-feedback.yml (belongs in the repo that runs Scout) and
requirements*.txt (consolidated into pyproject)
- update READMEs, CI (pip install -e .[dev], mypy src/scout), .gitignore
Co-authored-by: Douglas Blank <doug@comet.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README-OPIK-INTEGRATION.md
+21-21Lines changed: 21 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,26 +15,26 @@ The fix: an in-memory simulator that behaves like GitHub at the seams Scout actu
15
15
## Architecture
16
16
17
17
```
18
-
scout.py scout_eval.py
18
+
scout.triageevals/run_eval.py
19
19
│ │
20
20
▼ ▼
21
21
GitHubProvider GitHubSimulator ◄── implements ──┐
22
22
(PyGithub) (in-memory) │
23
23
│
24
24
RepositoryProvider
25
-
(providers/base.py)
25
+
(scout/providers/base.py)
26
26
▲
27
27
│
28
28
agent.run_agent
29
-
(agent.py)
29
+
(scout/agent.py)
30
30
```
31
31
32
32
`agent.run_agent` is the single agent loop. It depends only on the `RepositoryProvider` protocol — not on PyGithub, not on any module globals. Both backends satisfy the same interface; swapping them changes nothing about the loop, the prompt, the model, or the tool dispatch.
33
33
34
34
The protocol covers everything Scout's tools (and `main()`) need from "GitHub":
@@ -52,7 +52,7 @@ class RepositoryProvider(Protocol):
52
52
53
53
## The simulator
54
54
55
-
`providers/simulator.py` defines `GitHubSimulator` — a real object with mutable state, not a passive fixture. Scenarios build it up with a fluent API; the agent reads and writes against it; assertions inspect both the output text and the resulting state.
55
+
`src/scout/providers/simulator.py` defines `GitHubSimulator` — a real object with mutable state, not a passive fixture. Scenarios build it up with a fluent API; the agent reads and writes against it; assertions inspect both the output text and the resulting state.
56
56
57
57
```python
58
58
sim = (
@@ -113,7 +113,7 @@ Example real-GitHub spec:
113
113
114
114
## Scenarios — bridging JSON to the simulator
115
115
116
-
Opik dataset rows are JSON; simulator behavior is Python. `providers/scenarios.py` reconciles them with a small registry:
116
+
Opik dataset rows are JSON; simulator behavior is Python. `src/scout/providers/scenarios.py` reconciles them with a small registry:
@@ -203,7 +203,7 @@ See `evals/starter_scenarios.py` for five worked examples covering duplicate cit
203
203
204
204
## How the eval driver works
205
205
206
-
`scout_eval.py` is the bridge between Opik's `run_tests` and the agent loop:
206
+
`evals/run_eval.py` is the bridge between Opik's `run_tests` and the agent loop:
207
207
208
208
```python
209
209
deftask(item: dict) -> dict:
@@ -251,8 +251,8 @@ Assertions can reference any of these. Examples:
251
251
252
252
Both providers route traces through Opik identically — tracing lives at the agent/tool/LLM layer, not the provider layer. What differs is the project name:
253
253
254
-
- Production runs (`scout.py` → `GitHubProvider`) trace to `scout:<owner>/<repo>`.
255
-
- Eval runs (`scout_eval.py` → `GitHubSimulator`) trace to `scout-eval` (override with `SCOUT_EVAL_OPIK_PROJECT`).
254
+
- Production runs (`scout.triage` → `GitHubProvider`) trace to `scout:<owner>/<repo>`.
255
+
- Eval runs (`evals/run_eval.py` → `GitHubSimulator`) trace to `scout-eval` (override with `SCOUT_EVAL_OPIK_PROJECT`).
256
256
257
257
Different projects keep prod triage and eval experiments visually separate in the Opik UI. Eval runs are noisy — you may run a 5-item suite many times while iterating on the prompt — and you don't want that drowning out real triage traces.
`GITHUB_TOKEN` must be set because `scout.py` validates it at import time. For all-simulated suites the value is unused — `unused` is fine. **For suites that include real-GitHub-mode scenarios** (specs with no `files` key), it must be a real token with read access to the target repo. `SCOUT_EXPERIMENT_NAME` is treated as a *prefix*: each run gets `{prefix}-YYYY-MM-DD-HH-MM-SS` appended, so re-running without changing the env var produces a fresh, chronologically sortable experiment in the Opik UI.
282
+
`GITHUB_TOKEN` must be set because the triage module validates it at import time. For all-simulated suites the value is unused — `unused` is fine. **For suites that include real-GitHub-mode scenarios** (specs with no `files` key), it must be a real token with read access to the target repo. `SCOUT_EXPERIMENT_NAME` is treated as a *prefix*: each run gets `{prefix}-YYYY-MM-DD-HH-MM-SS` appended, so re-running without changing the env var produces a fresh, chronologically sortable experiment in the Opik UI.
283
283
284
284
## Adding a scenario
285
285
286
286
1.**Open `evals/starter_scenarios.py`** and append a new item to `STARTER_SCENARIOS` following the shape above.
287
287
2.**Write 3–6 specific assertions** that a judge can answer yes/no clearly. Reference the surfaced output keys (`output`, `final_labels`, `applied_labels`, `search_queries`) when behavior matters more than text.
288
-
3.**If your scenario needs programmable behavior** (flaky tools, multi-call state changes), register a new builder with `@register("your-name")` in `providers/scenarios.py` and reference it via `"scenario": "your-name"`. The base `_default` builder is composable — call it inside your builder and then mutate the result.
289
-
4.**Run `pytest test_scout.py -k Starter`** — three parametrized validation tests will check your scenario is structurally valid before you push.
288
+
3.**If your scenario needs programmable behavior** (flaky tools, multi-call state changes), register a new builder with `@register("your-name")` in `src/scout/providers/scenarios.py` and reference it via `"scenario": "your-name"`. The base `_default` builder is composable — call it inside your builder and then mutate the result.
289
+
4.**Run `pytest tests/test_triage.py -k Starter`** — three parametrized validation tests will check your scenario is structurally valid before you push.
Copy file name to clipboardExpand all lines: README.md
+48-4Lines changed: 48 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -209,9 +209,47 @@ Anyone viewing an issue can rate Scout's triage comment by adding a 👍 or 👎
209
209
- **1.0** = all 👍, **0.0** = all 👎, otherwise the ratio `👍 / (👍 + 👎)` (e.g. 3 👍 and 1 👎 → `0.75`). Reactions other than 👍/👎 are ignored.
210
210
- The score carries a `reason` that attributes the votes by GitHub login, e.g. `👍 2 (alice, bob) / 👎 1 (carol) from GitHub`, so you can see *who* reacted in the Opik UI alongside the trace.
211
211
212
-
**How it works.** GitHub fires no event when a reaction is added, so a scheduled workflow (`.github/workflows/scout-feedback.yml`, every 30 min) polls recent issues, reads the reaction counts on Scout's comments, and upserts the score. Each Scout comment carries a hidden marker (`<!-- scout-feedback trace_id=… -->`) that maps it back to its Opik trace. The sync is idempotent — re-running simply recomputes the score from current reactions — so feedback lands in Opik within one cron interval and self-corrects as votes change.
212
+
**How it works.** GitHub fires no event when a reaction is added, so a scheduled workflow polls recent issues every 30 minutes, reads the reaction counts on Scout's comments, and upserts the score. Each Scout comment carries a hidden marker (`<!-- scout-feedback trace_id=… -->`) that maps it back to its Opik trace. The sync is idempotent — re-running simply recomputes the score from current reactions — so feedback lands in Opik within one cron interval and self-corrects as votes change.
213
213
214
-
**Enabling it.** Add `.github/workflows/scout-feedback.yml` to the repo that runs Scout (it reuses the same `SCOUT_OPIK_API_KEY` secret and `OPIK_WORKSPACE` / `SCOUT_GITHUB_REPO_OWNER` / `SCOUT_GITHUB_REPO_NAME` variables). Trigger it manually from the Actions tab for an immediate sync.
214
+
**Enabling it.** The feedback sync ships as a second action published from this repo, `comet-ml/scout-repo-agent/actions/feedback`, alongside the triage action. Add a scheduled workflow to the repo that runs Scout — it reuses the same `OPIK_API_KEY` secret and `OPIK_WORKSPACE` value as the triage action:
215
+
216
+
```yaml
217
+
name: Scout Feedback Sync
218
+
219
+
on:
220
+
schedule:
221
+
- cron: '*/30 * * * *' # every 30 minutes
222
+
workflow_dispatch:
223
+
inputs:
224
+
since_days:
225
+
description: How many days back to scan issues for reactions
Trigger it manually from the Actions tab for an immediate sync.
215
253
216
254
> **Scan window.** GitHub does not bump an issue's `updated_at` when a reaction is added, so the sync only re-checks issues with other activity within `SCOUT_FEEDBACK_SINCE_DAYS` (default 7). Reactions on otherwise-quiet older issues may be missed — run the workflow manually with a larger `since_days` to backfill. Because the upsert is idempotent, re-syncing is always safe.
217
255
@@ -221,15 +259,21 @@ Use the manual trigger workflow in this repo's Actions tab (`Test Scout (Manual)
221
259
222
260
## Local development
223
261
262
+
The code is an installable package under `src/scout/`. Install it (with dev extras) in editable mode:
263
+
224
264
```bash
225
-
pip install -r requirements.txt
265
+
pip install -e ".[dev]"
226
266
227
267
# Copy and fill in the template
228
268
cp .env.example .env
229
269
230
-
python scout.py
270
+
# Run triage locally (console script registered by the install).
271
+
# Equivalent to `python -m scout.triage`.
272
+
scout-triage
231
273
```
232
274
275
+
Run the unit tests and linters with `pytest`, `ruff check .`, and `mypy src/scout`.
0 commit comments