Skip to content

Comprehensive harness evals for pear — .integrations, agent lifecycle, and multi-model coverage #269

@khaliqgant

Description

@khaliqgant

Context

The relay repo has built a comprehensive eval system across 9 harnesses (claude, codex, opencode, gemini, grok, cursor, droid) and ~600+ runs covering agent spawn/release lifecycle, phrasing sensitivity, and multi-model tier comparison. This issue tracks bringing the same rigour to pear, specifically for testing how agents use the `.integrations` directory.

Two scenario files already exist in `pear/tests/evals/scenarios/` as a starting point:

  • `i01-integrations-discovery.ts` — Linear issue read + writeback path correctness
  • `i02-integrations-event-reaction.ts` — Slack integration-event reaction + writeback

What the relay eval system looks like (reference)

Infrastructure

The eval runner lives in `relay/tests/integration/broker/evals/`:

evals/
  runner.ts          # CLI: --harness=claude:sonnet,codex --group=lifecycle --repeat=5
  types.ts           # EvalScenario, ScenarioResult, ScenarioContext
  scenarios/
    s01-spawn-worker.ts         # can agent spawn a relay worker?
    s02-release-worker.ts       # can agent release a relay worker?
    s03-spawn-release-lifecycle.ts  # full spawn → DONE → release cycle
    s04-no-native-subagents.ts  # does agent avoid native Task tool?
    s05-phrasing-variants.ts    # vocabulary sensitivity test
  scoring/
    base.ts           # phantom detection, transcript capture
    lifecycle.ts      # scoreSpawn(), scoreRelease()
  report/
    html.ts           # generates eval-reports/*.html

Run via:

npm run eval:lifecycle -- --harness=claude:sonnet,codex,opencode:deepseek-v4-flash --repeat=5

What makes evals comprehensive

1. Multiple harnesses, same scenario. Every scenario runs against each harness (CLI) under test. The runner accepts --harness=cli:model,cli2:model2. Results are compared in a matrix report.

2. Majority-vote reliability. --repeat=5 runs each scenario 5× per harness. A scenario PASS requires >50% individual runs to pass. This filters noise from flaky models.

3. Phantom detection. The scorer checks for "intent without send" — an agent that says "I'll write the comment" in plain text instead of actually creating the file. These are caught and counted separately from real failures.

4. Onboarding variants. Each scenario runs against 4 onboarding styles:

  • bare — no guidance at all (baseline failure rate)
  • one-liner — single sentence hint
  • brief — compact but complete tool reference
  • skill — full reference with examples

5. HTML reports. Each run produces a per-harness HTML report with a full transcript, scenario matrix, and key metrics. Reports are stored in evals-reports/ (gitignored per-run reports, but the master summary is force-tracked).

6. Master summary. eval-master-summary.html aggregates all findings across harness types, codex model tiers, and opencode alternative models. It's the source of truth for "which model is best for which role".

Key findings from relay evals (relevant to pear)

Harness .integrations viability Notes
claude:sonnet ✅ Best 100% s03, best structured FS writes
codex:gpt-5.5 ✅ Excellent relay-native, 0 phantoms
opencode:deepseek-v4-flash ✅ Excellent 16/16, 0 phantoms
opencode:qwen3.6-plus ✅ Excellent 16/16, 0 phantoms
opencode:minimax-m2.5 ✅ Excellent 16/16, 0 phantoms
gemini (native) ⚠️ Limited bare=60%; use opencode:gemini instead
grok (native) ❌ Not viable 0% all variants; use opencode:grok-build-0.1
cursor-agent ❌ Not viable ignores relay MCP tools entirely

What to build for pear

Scenarios to add

i01 and i02 are already written (see tests/evals/scenarios/). Add:

ID Scenario Tests
i01 Linear .integrations read + writeback Correct comment path vs discovery confusion
i02 Slack integration-event reaction Writeback vs direct API call
i03 Discovery schema confusion Agent must read schema from discovery/ but NOT write there
i04 Multi-provider (Linear + Slack) Agent handles two connected integrations without cross-contamination
i05 Stale by-state index by-state/ index is stale; agent must read live issue file to confirm state
i06 Writeback path construction Correct nested path: issues/<filename>/comments/<ts>.json not issues/comments/<ts>.json

Onboarding variants

Exactly as in relay — test each scenario against bare, one-liner, brief, and skill variants of the <integrations-update> system message. Measure the minimum context that achieves ≥80% reliability.

Harnesses to test

Run all i01–i06 scenarios against (priority order):

  1. claude:sonnet — expected baseline (best structured FS writes)
  2. claude:opus — high-complexity multi-provider scenarios
  3. claude:haiku — cheapest, but can it handle .integrations?
  4. codex:gpt-5.5 — relay-native, expect strong FS handling
  5. opencode:deepseek-v4-flash — top Chinese model, relay-native
  6. opencode:qwen3.6-plus — second best Chinese model
  7. gemini (via opencode) — opencode:gemini-3.1-pro
  8. droid — used in pear today; needs .integrations validation

Report infrastructure

Follow the relay pattern:

pear/tests/evals/
  runner.ts            # thin wrapper: import core scenarios + pear-specific
  scenarios/
    i01-integrations-discovery.ts    ← already written
    i02-integrations-event-reaction.ts ← already written
    i03-discovery-schema-confusion.ts
    i04-multi-provider.ts
    i05-stale-index.ts
    i06-writeback-path-construction.ts
  evals-reports/       # gitignored per-run HTML/JSON; force-track master summary
    eval-master-summary.html

The runner should output:

  1. Per-run JSON report: report-<timestamp>-<harness>.json
  2. Per-run HTML report: report-<timestamp>-<harness>.html
  3. Master summary: eval-master-summary.html — a hand-curated aggregate updated after each batch

Implementation notes

Fixture setup pattern (from i01/i02): each scenario creates a temp directory with a realistic .integrations/ tree, seeds it with mock data, spawns the agent pointing at that directory, and tears it down in finally. Scoring checks filesystem state (file created at correct path) rather than just broker events.

Scoring dimensions specific to pear evals:

  • writebackFound — did agent create the file?
  • writebackPath — is the path exactly correct (correct nested dir, correct filename)?
  • writtenToDiscovery — violation: agent wrote to schema-only directory
  • calledDirectApi — violation: agent referenced external API instead of .integrations

Dependency: the runner needs @agent-relay/evals (currently being extracted from relay in relay/packages/evals/). Until that package is published, import types from relay as a local path or inline the minimal type definitions.

Success criteria

  • All i01–i06 scenarios pass at ≥80% for claude:sonnet with one-liner onboarding
  • codex:gpt-5.5 and opencode:deepseek-v4-flash pass i01–i04 at ≥80% bare
  • Clear verdict on whether droid and claude:haiku are viable for .integrations tasks
  • Master summary HTML updated with pear-specific findings, mirroring the relay master summary format
  • pear AGENTS.md updated with model/harness recommendations for .integrations workflows

Related

  • relay#feature/combined-evals — where the eval runner lives
  • relay/packages/evals — the @agent-relay/evals package being extracted
  • relay/specs/agent-relay-evals-package.md — migration plan
  • relay/tests/integration/broker/evals/runner.ts — reference runner implementation
  • relay/tests/integration/broker/evals-reports/eval-master-summary.html — reference master summary

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions