-
Notifications
You must be signed in to change notification settings - Fork 4.3k
feat(agent-evals): add suite-based behavioral eval harness for agent onboarding fixes NV-8059 #11589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
feat(agent-evals): add suite-based behavioral eval harness for agent onboarding fixes NV-8059 #11589
Changes from 15 commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
e0a0074
feat(libs): add suite-based behavioral eval harness for agent onboarding
djabarovgeorge a1da7bc
Merge branch 'next' into feat/agent-evals-harness
djabarovgeorge d3d61fd
feat(libs): refine agent-evals harness and wire shared doc export fix…
djabarovgeorge a20c131
refactor(agent-evals): streamline evaluation harness and enhance grad…
djabarovgeorge 14e65f5
feat(agent-evals): enhance onboarding flow with dashboard OAuth and U…
djabarovgeorge fa6efbd
feat(agent-evals): enhance README and grading logic for better failur…
djabarovgeorge 79cf8ff
chore(agent-evals): simplify GitHub Actions workflow for agent evalua…
djabarovgeorge 2ffc5a1
fix(agent-evals): harden harness guards from PR review fixes NV-8059
djabarovgeorge 5be8c98
ci(agent-evals): run eval workflow on harness changes fixes NV-8059
djabarovgeorge fffdf2f
fix(agent-evals): accept markdown QR delivery and wire scheduled eval…
djabarovgeorge e3b8200
fix(agent-evals): make watcher guard quote/escape aware fixes NV-8059
djabarovgeorge 06e09c5
Merge remote-tracking branch 'origin/next' into feat/agent-evals-harness
djabarovgeorge 704e8ee
fix(ci): run agent eval workflows only on PRs to next
djabarovgeorge db7d9aa
fix(ci): trigger onboarding webhook on merge to next
djabarovgeorge 85646fa
chore(ci): drop unrelated onboarding webhook workflow changes
djabarovgeorge 40d37e9
feat(agent-evals): always run LLM judge graders
djabarovgeorge 20e1f5f
fix(agent-evals): record each Read tool call once fixes NV-8059
cursoragent 9ebc0d5
fix(agent-evals): harden connect parsing, channel/keyless validation,…
cursoragent 6136d1b
fix(agent-evals): avoid duplicating final turn in transcriptText fixe…
cursoragent 6629cd3
fix(agent-evals): align conclusion-first judge prompt with playbook N…
djabarovgeorge 9370681
fix(agent-evals): enforce keyless flow in keyless-default scenarios f…
cursoragent 707fcce
Merge remote-tracking branch 'origin/next' into cursor/agent-evals-ha…
cursoragent File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| --- | ||
| name: triage-agent-eval-failures | ||
| description: Triage failing @novu/agent-evals scenarios to decide whether a failure is real or flaky, and whether to fix the playbook/prompt or the test (grader, tape, scenario, or judge). Use when an agent-evals scenario fails, when the user asks why an eval is red, or when deciding whether to fix the test or the prompt. | ||
| --- | ||
|
|
||
| # Triage Agent Eval Failures | ||
|
|
||
| Diagnose a failing scenario in `libs/agent-evals` and produce a verdict: is the failure **real** (the playbook under test regressed) or is the **test** wrong (grader / tape / scenario / judge), or is it just **flaky** (model non-determinism)? | ||
|
|
||
| The thing under test is the playbook doc (`packages/shared/docs/agent-onboarding.md`), injected as the agent system prompt. Everything else (`graders.ts`, `catalog.ts`, `scenario.ts`, judge prompts) is test scaffolding. **Never fix the playbook to satisfy a broken grader, and never loosen a grader to hide a real playbook regression.** | ||
|
|
||
| ## Rule 0: rule out flakiness before changing anything | ||
|
|
||
| Scenarios run a live model concurrently, so one red run is one sample, not a verdict. Re-run the single failing scenario 3–5× first: | ||
|
|
||
| ```bash | ||
| pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t <scenario-id> | ||
| ``` | ||
|
|
||
| - Fails **every** run → deterministic failure, continue triage. | ||
| - Fails **intermittently** → flaky. The cause is usually a non-deterministic judge grader or an over-strict regex. Do not edit the playbook. Tighten the grader/judge prompt or accept variance; consider pass@k rather than single-run gating. | ||
|
|
||
| To reproduce judge graders locally (PR/push CI runs deterministic graders only): | ||
|
|
||
| ```bash | ||
| NOVU_EVAL_JUDGE=true pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t <scenario-id> | ||
| ``` | ||
|
|
||
| ## Step 1: identify which grader failed and its kind | ||
|
|
||
| Each scenario registers graders in `scenarios/<id>/graders.ts`. The **kind** is the strongest triage signal: | ||
|
|
||
| - **Deterministic** graders (`catalog.*`, `contains`, `matches`) inspect the structured `RunResult`. A fail means the agent's actions/output objectively did not match — or the check is too strict. | ||
| - **Judge** graders (`sharedJudgeGraders`, `judge(...)`) call a second LLM pass. A fail is fuzzy and can be the judge prompt's fault, not the agent's. | ||
|
|
||
| Find the grader's logic: | ||
|
|
||
| | Layer | Location | | ||
| | --- | --- | | ||
| | Per-scenario grader wiring | `src/suites/agent-onboarding/scenarios/<id>/graders.ts` | | ||
| | Deterministic grader bodies | `src/suites/agent-onboarding/catalog.ts` (`catalog` object) | | ||
| | Judge prompts | `catalog.ts` (`judgePrompts`) + `sharedJudgeGraders` | | ||
| | Generic helpers | `src/core/graders.ts` (`contains`, `matches`, `toolCallsNamed`, `transcriptText`) | | ||
| | Judge mechanics | `src/core/judge.ts` (returns `skip` on `UNKNOWN`) | | ||
|
|
||
| ## Step 2: read the RunResult evidence | ||
|
|
||
| Graders read fields off `RunResult` (`src/core/types.ts`). Map the failing grader to the field it checks and compare against what the agent actually did in the run output: | ||
|
|
||
| - `trackedCommands` — raw connect command strings (flag checks like `--keyless`, `--secret-key`, `--slack-config-token`). | ||
| - `toolCalls` — every `Bash` / `BashOutput` / `AskUserQuestion` / `Read` call with args (`run_in_background`, `file_path`, picker `selectedId`). | ||
| - `polledShellIds` / `killedShellIds` — background-polling and kill behavior. | ||
| - `capturedUrls` / `openedFiles` — surfaced URLs and opened files (e.g. QR `.png`, auth-url file). | ||
| - `finalText` / `assistantMessages` — user-facing report (`transcriptText` joins these). | ||
| - `metadata.description` — the drafted agent description (persona / infra-token graders). | ||
|
|
||
| ## Step 3: classify the failure | ||
|
|
||
| Walk top-down and stop at the first match: | ||
|
|
||
| | Symptom | Verdict | Fix target | | ||
| | --- | --- | --- | | ||
| | Agent never ran the tracked command / ignored an instruction it should follow | **Real — discovery** | Playbook `agent-onboarding.md` (instruction unclear/missing) | | ||
| | Deterministic grader fails and the `RunResult` confirms the agent genuinely did the wrong thing | **Real — execution** | Playbook `agent-onboarding.md` | | ||
| | Deterministic grader fails but `RunResult` shows the agent behaved correctly (regex too strict, wrong field, valid variant rejected) | **Test bug** | `catalog.ts` grader logic | | ||
| | Fails only on the scripted CLI path; tape stdout/`when`/`validate` or scripted answers are wrong or stale | **Test bug** | `scenario.ts` (`tape`, `scriptedAnswers`), `connect-parser.ts` | | ||
| | Judge grader fails but the description/report actually satisfies the criterion | **Test bug** | Judge prompt in `catalog.ts` (`judgePrompts`) | | ||
| | Judge verdict flips run-to-run | **Flaky judge** | Sharpen judge prompt; rely on `UNKNOWN`→`skip` escape hatch | | ||
| | Passes sometimes, fails sometimes, no clear cause | **Flaky** | Do not edit playbook; re-run (Rule 0) | | ||
|
|
||
| A scenario passes only when every active grader averages ≥ `0.8` (`JUDGE_THRESHOLD`). A judge returning `UNKNOWN` becomes `skip` and scores `1` — it never causes a fail, so an `UNKNOWN` is not evidence of a real regression. | ||
|
|
||
| ## Step 4: apply one bounded fix, then verify | ||
|
|
||
| 1. Change **only** the layer the verdict points to — playbook **or** test, never both to chase green. | ||
| 2. Re-run the single scenario (Step 0 command), with `NOVU_EVAL_JUDGE=true` if a judge grader was involved. | ||
| 3. Confirm the fix holds across the 3–5 re-runs and that no other scenario regressed. | ||
| 4. If editing a deterministic grader, also run the synthetic unit tests so you don't break grader contracts: | ||
|
|
||
| ```bash | ||
| pnpm --filter @novu/agent-evals test | ||
| ``` | ||
|
|
||
| ## Output format | ||
|
|
||
| Report the verdict concisely with cited evidence: | ||
|
|
||
| ``` | ||
| Scenario: <id> | ||
| Failing grader: <name> (deterministic | judge) | ||
| Re-run result: <N/M failed> → real | flaky | ||
| Evidence: <RunResult field + actual vs expected> | ||
| Verdict: real playbook regression | test bug (<grader|tape|scenario|judge>) | flaky | ||
| Fix target: <file path> (or: no change — flaky/UNKNOWN) | ||
| ``` | ||
|
|
||
| ## Additional resources | ||
|
|
||
| For worked triage examples (real regression vs test bug vs flaky judge), see [reference.md](reference.md). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # Triage examples | ||
|
|
||
| Worked examples for the `triage-agent-eval-failures` skill. Each walks through evidence → verdict → fix target. | ||
|
|
||
| ## Example 1: Real playbook regression — `usedDashboardOAuthWhenPrompted` | ||
|
|
||
| **Scenario:** `dashboard-prompt-login` | ||
| **Failing grader:** `usedDashboardOAuthWhenPrompted` (deterministic) | ||
| **Re-run result:** 5/5 failed → real | ||
|
|
||
| **Evidence:** | ||
|
|
||
| ``` | ||
| userPrompt: "I'm signed in to the Novu dashboard..." | ||
| trackedCommands: ["npx novu connect --keyless --channel slack"] | ||
| ``` | ||
|
|
||
| The grader in `catalog.ts` checks: when `userPrompt` mentions "signed in to the Novu dashboard", every `trackedCommands` entry must omit `--keyless`. The agent ran connect with `--keyless` anyway. | ||
|
|
||
| **Verdict:** Real — execution. The playbook did not steer the agent toward dashboard OAuth when the user says they are signed in. | ||
|
|
||
| **Fix target:** `packages/shared/docs/agent-onboarding.md` — clarify that dashboard-signed-in users must omit `--keyless`. | ||
|
|
||
| **Do not:** Loosen the grader to accept `--keyless` when the prompt mentions the dashboard. | ||
|
|
||
| --- | ||
|
|
||
| ## Example 2: Test bug — `readAuthUrlFile` with correct behavior | ||
|
|
||
| **Scenario:** `dashboard-prompt-login` | ||
| **Failing grader:** `readAuthUrlFile` (deterministic) | ||
| **Re-run result:** 5/5 failed → real (but test is wrong) | ||
|
|
||
| **Evidence:** | ||
|
|
||
| ``` | ||
| toolCalls: [ | ||
| { name: "Read", args: { file_path: "/project/novu-connect-auth-url.txt" } } | ||
| ] | ||
| capturedUrls: ["https://auth.novu.test/oauth/device?code=abc"] | ||
| transcriptText: "Open https://auth.novu.test/oauth/device?code=abc to authorize" | ||
| ``` | ||
|
|
||
| The grader checks for `novu-connect-auth-url` in the Read path, `/oauth/device` in `capturedUrls`, or `/oauth/device` in the transcript. All three are satisfied. | ||
|
|
||
| **Verdict:** Test bug — grader. The failure reason may reference a path variant the check does not cover (e.g. relative vs absolute path in `file_path`). Inspect `catalog.readAuthUrlFile` for an overly narrow `includes('novu-connect-auth-url')` match. | ||
|
|
||
| **Fix target:** `src/suites/agent-onboarding/catalog.ts` — widen the Read path check or normalize paths before comparing. | ||
|
|
||
| **Do not:** Change the playbook; the agent already surfaced the auth URL correctly. | ||
|
|
||
| --- | ||
|
|
||
| ## Example 3: Flaky judge — `conclusionFirstReport` | ||
|
|
||
| **Scenario:** `dashboard-prompt-login` | ||
| **Failing grader:** `conclusionFirstReport` (judge) | ||
| **Re-run result:** 2/5 failed → flaky | ||
|
|
||
| **Evidence (passing run):** | ||
|
|
||
| ``` | ||
| finalText: "✓ Your agent is live. Open the dashboard to manage it: https://dashboard.novu.test/agents/dash-agent-1" | ||
| ``` | ||
|
|
||
| **Evidence (failing run, same agent output):** | ||
|
|
||
| ``` | ||
| finalText: "✓ Your agent is live. Open the dashboard to manage it: https://dashboard.novu.test/agents/dash-agent-1" | ||
| judge rationale: "The message leads with a success statement but then adds setup context before the next action." | ||
| ``` | ||
|
|
||
| The deterministic graders all pass. The judge prompt asks whether the first line states the CLI result followed by the single next action. The agent output is identical; only the judge verdict flips. | ||
|
|
||
| **Verdict:** Flaky judge. Non-deterministic LLM grading on a borderline structure. | ||
|
|
||
| **Fix target:** Either sharpen `judgePrompts.conclusionFirstReport` in `catalog.ts` with explicit pass/fail examples, or accept variance and track pass@k. Do not edit the playbook for a 2/5 flake. | ||
|
|
||
| **Note:** A judge returning `UNKNOWN` scores as `skip` (pass). An `UNKNOWN` is not a regression signal. | ||
|
|
||
| --- | ||
|
|
||
| ## Example 4: Test bug — stale tape chunk | ||
|
|
||
| **Scenario:** `dashboard-prompt-login` | ||
| **Failing grader:** `reportedSuccess` (deterministic) | ||
| **Re-run result:** 5/5 failed → real (but tape is wrong) | ||
|
|
||
| **Evidence:** | ||
|
|
||
| ``` | ||
| trackedCommands: ["npx novu connect --channel slack"] // correct | ||
| polledShellIds: ["shell-1"] // correct | ||
| transcriptText: "Waiting for connect to finish..." // agent never saw success stdout | ||
| ``` | ||
|
|
||
| The agent polled the background shell but the final transcript never contains "agent is live". The tape in `scenario.ts` emits success stdout in the last chunk, but `connectTape` validation rejected the command before replay (e.g. `requireNoKeyless: true` but parser flags differ). | ||
|
|
||
| **Verdict:** Test bug — tape/scenario. The fixture did not replay the expected CLI output; the agent behaved correctly given what it received. | ||
|
|
||
| **Fix target:** `scenarios/dashboard-prompt-login/scenario.ts` — fix `tape` chunks or `connectTape` validation flags. Check `connect-parser.ts` if parsed flags do not match tape `when` conditions. | ||
|
|
||
| **Do not:** Change the playbook to tell the agent to report success when the CLI gave no success signal. | ||
|
|
||
| --- | ||
|
|
||
| ## Example 5: Real playbook regression — `confirmedBeforeRun` | ||
|
|
||
| **Scenario:** `persona-infra-exclusion` | ||
| **Failing grader:** `confirmedBeforeRun` (deterministic) | ||
| **Re-run result:** 5/5 failed → real | ||
|
|
||
| **Evidence:** | ||
|
|
||
| ``` | ||
| toolCalls: [ | ||
| { name: "Bash", args: { command: "npx novu connect ..." } }, // index 0 | ||
| { name: "AskUserQuestion", result: { selectedId: "approve" } } // index 2 | ||
| ] | ||
| ``` | ||
|
|
||
| The grader requires an `AskUserQuestion` with `selectedId: "approve"` **before** the first connect `Bash` call. Connect ran first. | ||
|
|
||
| **Verdict:** Real — execution. The playbook does not enforce (or the agent ignored) the confirm-before-run step. | ||
|
|
||
| **Fix target:** `packages/shared/docs/agent-onboarding.md` — strengthen the approval picker requirement before running connect. | ||
|
|
||
| **Do not:** Remove or weaken `catalog.confirmedBeforeRun`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| name: Agent evals | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: | ||
| - next | ||
| paths: | ||
| - packages/shared/docs/agent-onboarding.md | ||
| - libs/agent-evals/** | ||
| - .github/workflows/agent-evals.yml | ||
|
djabarovgeorge marked this conversation as resolved.
|
||
|
|
||
| jobs: | ||
| evals: | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 45 | ||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5 | ||
|
|
||
| - name: Setup pnpm | ||
| uses: pnpm/action-setup@0e279bb959325dab635dd2c09392533439d90093 # v6.0.8 | ||
| with: | ||
| version: 11.0.9 | ||
|
|
||
| - name: Setup Node.js | ||
| uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4 | ||
| with: | ||
| node-version: 22 | ||
| cache: pnpm | ||
|
|
||
| - name: Install dependencies | ||
| run: pnpm install --frozen-lockfile | ||
|
|
||
| - name: Run agent evals | ||
| env: | ||
| ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} | ||
| NOVU_EVAL_JUDGE: 'false' | ||
| run: pnpm --filter @novu/agent-evals eval src/suites/agent-onboarding | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.