Context
The relay repo has built a comprehensive eval system across 9 harnesses (claude, codex, opencode, gemini, grok, cursor, droid) and ~600+ runs covering agent spawn/release lifecycle, phrasing sensitivity, and multi-model tier comparison. This issue tracks bringing the same rigour to pear, specifically for testing how agents use the `.integrations` directory.
Two scenario files already exist in `pear/tests/evals/scenarios/` as a starting point:
- `i01-integrations-discovery.ts` — Linear issue read + writeback path correctness
- `i02-integrations-event-reaction.ts` — Slack integration-event reaction + writeback
What the relay eval system looks like (reference)
Infrastructure
The eval runner lives in `relay/tests/integration/broker/evals/`:
evals/
runner.ts # CLI: --harness=claude:sonnet,codex --group=lifecycle --repeat=5
types.ts # EvalScenario, ScenarioResult, ScenarioContext
scenarios/
s01-spawn-worker.ts # can agent spawn a relay worker?
s02-release-worker.ts # can agent release a relay worker?
s03-spawn-release-lifecycle.ts # full spawn → DONE → release cycle
s04-no-native-subagents.ts # does agent avoid native Task tool?
s05-phrasing-variants.ts # vocabulary sensitivity test
scoring/
base.ts # phantom detection, transcript capture
lifecycle.ts # scoreSpawn(), scoreRelease()
report/
html.ts # generates eval-reports/*.html
Run via:
npm run eval:lifecycle -- --harness=claude:sonnet,codex,opencode:deepseek-v4-flash --repeat=5
What makes evals comprehensive
1. Multiple harnesses, same scenario. Every scenario runs against each harness (CLI) under test. The runner accepts --harness=cli:model,cli2:model2. Results are compared in a matrix report.
2. Majority-vote reliability. --repeat=5 runs each scenario 5× per harness. A scenario PASS requires >50% individual runs to pass. This filters noise from flaky models.
3. Phantom detection. The scorer checks for "intent without send" — an agent that says "I'll write the comment" in plain text instead of actually creating the file. These are caught and counted separately from real failures.
4. Onboarding variants. Each scenario runs against 4 onboarding styles:
bare — no guidance at all (baseline failure rate)
one-liner — single sentence hint
brief — compact but complete tool reference
skill — full reference with examples
5. HTML reports. Each run produces a per-harness HTML report with a full transcript, scenario matrix, and key metrics. Reports are stored in evals-reports/ (gitignored per-run reports, but the master summary is force-tracked).
6. Master summary. eval-master-summary.html aggregates all findings across harness types, codex model tiers, and opencode alternative models. It's the source of truth for "which model is best for which role".
Key findings from relay evals (relevant to pear)
| Harness |
.integrations viability |
Notes |
| claude:sonnet |
✅ Best |
100% s03, best structured FS writes |
| codex:gpt-5.5 |
✅ Excellent |
relay-native, 0 phantoms |
| opencode:deepseek-v4-flash |
✅ Excellent |
16/16, 0 phantoms |
| opencode:qwen3.6-plus |
✅ Excellent |
16/16, 0 phantoms |
| opencode:minimax-m2.5 |
✅ Excellent |
16/16, 0 phantoms |
| gemini (native) |
⚠️ Limited |
bare=60%; use opencode:gemini instead |
| grok (native) |
❌ Not viable |
0% all variants; use opencode:grok-build-0.1 |
| cursor-agent |
❌ Not viable |
ignores relay MCP tools entirely |
What to build for pear
Scenarios to add
i01 and i02 are already written (see tests/evals/scenarios/). Add:
| ID |
Scenario |
Tests |
| i01 |
Linear .integrations read + writeback |
Correct comment path vs discovery confusion |
| i02 |
Slack integration-event reaction |
Writeback vs direct API call |
| i03 |
Discovery schema confusion |
Agent must read schema from discovery/ but NOT write there |
| i04 |
Multi-provider (Linear + Slack) |
Agent handles two connected integrations without cross-contamination |
| i05 |
Stale by-state index |
by-state/ index is stale; agent must read live issue file to confirm state |
| i06 |
Writeback path construction |
Correct nested path: issues/<filename>/comments/<ts>.json not issues/comments/<ts>.json |
Onboarding variants
Exactly as in relay — test each scenario against bare, one-liner, brief, and skill variants of the <integrations-update> system message. Measure the minimum context that achieves ≥80% reliability.
Harnesses to test
Run all i01–i06 scenarios against (priority order):
claude:sonnet — expected baseline (best structured FS writes)
claude:opus — high-complexity multi-provider scenarios
claude:haiku — cheapest, but can it handle .integrations?
codex:gpt-5.5 — relay-native, expect strong FS handling
opencode:deepseek-v4-flash — top Chinese model, relay-native
opencode:qwen3.6-plus — second best Chinese model
gemini (via opencode) — opencode:gemini-3.1-pro
droid — used in pear today; needs .integrations validation
Report infrastructure
Follow the relay pattern:
pear/tests/evals/
runner.ts # thin wrapper: import core scenarios + pear-specific
scenarios/
i01-integrations-discovery.ts ← already written
i02-integrations-event-reaction.ts ← already written
i03-discovery-schema-confusion.ts
i04-multi-provider.ts
i05-stale-index.ts
i06-writeback-path-construction.ts
evals-reports/ # gitignored per-run HTML/JSON; force-track master summary
eval-master-summary.html
The runner should output:
- Per-run JSON report:
report-<timestamp>-<harness>.json
- Per-run HTML report:
report-<timestamp>-<harness>.html
- Master summary:
eval-master-summary.html — a hand-curated aggregate updated after each batch
Implementation notes
Fixture setup pattern (from i01/i02): each scenario creates a temp directory with a realistic .integrations/ tree, seeds it with mock data, spawns the agent pointing at that directory, and tears it down in finally. Scoring checks filesystem state (file created at correct path) rather than just broker events.
Scoring dimensions specific to pear evals:
writebackFound — did agent create the file?
writebackPath — is the path exactly correct (correct nested dir, correct filename)?
writtenToDiscovery — violation: agent wrote to schema-only directory
calledDirectApi — violation: agent referenced external API instead of .integrations
Dependency: the runner needs @agent-relay/evals (currently being extracted from relay in relay/packages/evals/). Until that package is published, import types from relay as a local path or inline the minimal type definitions.
Success criteria
- All i01–i06 scenarios pass at ≥80% for
claude:sonnet with one-liner onboarding
codex:gpt-5.5 and opencode:deepseek-v4-flash pass i01–i04 at ≥80% bare
- Clear verdict on whether
droid and claude:haiku are viable for .integrations tasks
- Master summary HTML updated with pear-specific findings, mirroring the relay master summary format
- pear AGENTS.md updated with model/harness recommendations for .integrations workflows
Related
- relay#feature/combined-evals — where the eval runner lives
- relay/packages/evals — the
@agent-relay/evals package being extracted
- relay/specs/agent-relay-evals-package.md — migration plan
- relay/tests/integration/broker/evals/runner.ts — reference runner implementation
- relay/tests/integration/broker/evals-reports/eval-master-summary.html — reference master summary
Context
The relay repo has built a comprehensive eval system across 9 harnesses (claude, codex, opencode, gemini, grok, cursor, droid) and ~600+ runs covering agent spawn/release lifecycle, phrasing sensitivity, and multi-model tier comparison. This issue tracks bringing the same rigour to pear, specifically for testing how agents use the `.integrations` directory.
Two scenario files already exist in `pear/tests/evals/scenarios/` as a starting point:
What the relay eval system looks like (reference)
Infrastructure
The eval runner lives in `relay/tests/integration/broker/evals/`:
Run via:
What makes evals comprehensive
1. Multiple harnesses, same scenario. Every scenario runs against each harness (CLI) under test. The runner accepts
--harness=cli:model,cli2:model2. Results are compared in a matrix report.2. Majority-vote reliability.
--repeat=5runs each scenario 5× per harness. A scenario PASS requires >50% individual runs to pass. This filters noise from flaky models.3. Phantom detection. The scorer checks for "intent without send" — an agent that says "I'll write the comment" in plain text instead of actually creating the file. These are caught and counted separately from real failures.
4. Onboarding variants. Each scenario runs against 4 onboarding styles:
bare— no guidance at all (baseline failure rate)one-liner— single sentence hintbrief— compact but complete tool referenceskill— full reference with examples5. HTML reports. Each run produces a per-harness HTML report with a full transcript, scenario matrix, and key metrics. Reports are stored in
evals-reports/(gitignored per-run reports, but the master summary is force-tracked).6. Master summary.
eval-master-summary.htmlaggregates all findings across harness types, codex model tiers, and opencode alternative models. It's the source of truth for "which model is best for which role".Key findings from relay evals (relevant to pear)
What to build for pear
Scenarios to add
i01 and i02 are already written (see
tests/evals/scenarios/). Add:discovery/but NOT write thereby-state/index is stale; agent must read live issue file to confirm stateissues/<filename>/comments/<ts>.jsonnotissues/comments/<ts>.jsonOnboarding variants
Exactly as in relay — test each scenario against
bare,one-liner,brief, andskillvariants of the<integrations-update>system message. Measure the minimum context that achieves ≥80% reliability.Harnesses to test
Run all i01–i06 scenarios against (priority order):
claude:sonnet— expected baseline (best structured FS writes)claude:opus— high-complexity multi-provider scenariosclaude:haiku— cheapest, but can it handle .integrations?codex:gpt-5.5— relay-native, expect strong FS handlingopencode:deepseek-v4-flash— top Chinese model, relay-nativeopencode:qwen3.6-plus— second best Chinese modelgemini(via opencode) — opencode:gemini-3.1-prodroid— used in pear today; needs .integrations validationReport infrastructure
Follow the relay pattern:
The runner should output:
report-<timestamp>-<harness>.jsonreport-<timestamp>-<harness>.htmleval-master-summary.html— a hand-curated aggregate updated after each batchImplementation notes
Fixture setup pattern (from i01/i02): each scenario creates a temp directory with a realistic
.integrations/tree, seeds it with mock data, spawns the agent pointing at that directory, and tears it down infinally. Scoring checks filesystem state (file created at correct path) rather than just broker events.Scoring dimensions specific to pear evals:
writebackFound— did agent create the file?writebackPath— is the path exactly correct (correct nested dir, correct filename)?writtenToDiscovery— violation: agent wrote to schema-only directorycalledDirectApi— violation: agent referenced external API instead of .integrationsDependency: the runner needs
@agent-relay/evals(currently being extracted from relay inrelay/packages/evals/). Until that package is published, import types from relay as a local path or inline the minimal type definitions.Success criteria
claude:sonnetwithone-lineronboardingcodex:gpt-5.5andopencode:deepseek-v4-flashpass i01–i04 at ≥80% baredroidandclaude:haikuare viable for .integrations tasksRelated
@agent-relay/evalspackage being extracted