Skip to content
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
e0a0074
feat(libs): add suite-based behavioral eval harness for agent onboarding
djabarovgeorge Jun 16, 2026
a1da7bc
Merge branch 'next' into feat/agent-evals-harness
djabarovgeorge Jun 16, 2026
d3d61fd
feat(libs): refine agent-evals harness and wire shared doc export fix…
djabarovgeorge Jun 17, 2026
a20c131
refactor(agent-evals): streamline evaluation harness and enhance grad…
djabarovgeorge Jun 17, 2026
14e65f5
feat(agent-evals): enhance onboarding flow with dashboard OAuth and U…
djabarovgeorge Jun 17, 2026
fa6efbd
feat(agent-evals): enhance README and grading logic for better failur…
djabarovgeorge Jun 18, 2026
79cf8ff
chore(agent-evals): simplify GitHub Actions workflow for agent evalua…
djabarovgeorge Jun 18, 2026
2ffc5a1
fix(agent-evals): harden harness guards from PR review fixes NV-8059
djabarovgeorge Jun 18, 2026
5be8c98
ci(agent-evals): run eval workflow on harness changes fixes NV-8059
djabarovgeorge Jun 18, 2026
fffdf2f
fix(agent-evals): accept markdown QR delivery and wire scheduled eval…
djabarovgeorge Jun 18, 2026
e3b8200
fix(agent-evals): make watcher guard quote/escape aware fixes NV-8059
djabarovgeorge Jun 18, 2026
06e09c5
Merge remote-tracking branch 'origin/next' into feat/agent-evals-harness
djabarovgeorge Jun 18, 2026
704e8ee
fix(ci): run agent eval workflows only on PRs to next
djabarovgeorge Jun 21, 2026
db7d9aa
fix(ci): trigger onboarding webhook on merge to next
djabarovgeorge Jun 21, 2026
85646fa
chore(ci): drop unrelated onboarding webhook workflow changes
djabarovgeorge Jun 21, 2026
40d37e9
feat(agent-evals): always run LLM judge graders
djabarovgeorge Jun 21, 2026
20e1f5f
fix(agent-evals): record each Read tool call once fixes NV-8059
cursoragent Jun 21, 2026
9ebc0d5
fix(agent-evals): harden connect parsing, channel/keyless validation,…
cursoragent Jun 21, 2026
6136d1b
fix(agent-evals): avoid duplicating final turn in transcriptText fixe…
cursoragent Jun 21, 2026
6629cd3
fix(agent-evals): align conclusion-first judge prompt with playbook N…
djabarovgeorge Jun 22, 2026
9370681
fix(agent-evals): enforce keyless flow in keyless-default scenarios f…
cursoragent Jun 22, 2026
707fcce
Merge remote-tracking branch 'origin/next' into cursor/agent-evals-ha…
cursoragent Jun 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions .cursor/skills/triage-agent-eval-failures/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
name: triage-agent-eval-failures
description: Triage failing @novu/agent-evals scenarios to decide whether a failure is real or flaky, and whether to fix the playbook/prompt or the test (grader, tape, scenario, or judge). Use when an agent-evals scenario fails, when the user asks why an eval is red, or when deciding whether to fix the test or the prompt.
---

# Triage Agent Eval Failures

Diagnose a failing scenario in `libs/agent-evals` and produce a verdict: is the failure **real** (the playbook under test regressed) or is the **test** wrong (grader / tape / scenario / judge), or is it just **flaky** (model non-determinism)?

The thing under test is the playbook doc (`packages/shared/docs/agent-onboarding.md`), injected as the agent system prompt. Everything else (`graders.ts`, `catalog.ts`, `scenario.ts`, judge prompts) is test scaffolding. **Never fix the playbook to satisfy a broken grader, and never loosen a grader to hide a real playbook regression.**

## Rule 0: rule out flakiness before changing anything

Scenarios run a live model concurrently, so one red run is one sample, not a verdict. Re-run the single failing scenario 3–5× first:

```bash
pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t <scenario-id>
```

- Fails **every** run → deterministic failure, continue triage.
- Fails **intermittently** → flaky. The cause is usually a non-deterministic judge grader or an over-strict regex. Do not edit the playbook. Tighten the grader/judge prompt or accept variance; consider pass@k rather than single-run gating.

To reproduce judge graders locally (PR/push CI runs deterministic graders only):

```bash
NOVU_EVAL_JUDGE=true pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t <scenario-id>
```

## Step 1: identify which grader failed and its kind

Each scenario registers graders in `scenarios/<id>/graders.ts`. The **kind** is the strongest triage signal:

- **Deterministic** graders (`catalog.*`, `contains`, `matches`) inspect the structured `RunResult`. A fail means the agent's actions/output objectively did not match — or the check is too strict.
- **Judge** graders (`sharedJudgeGraders`, `judge(...)`) call a second LLM pass. A fail is fuzzy and can be the judge prompt's fault, not the agent's.

Find the grader's logic:

| Layer | Location |
| --- | --- |
| Per-scenario grader wiring | `src/suites/agent-onboarding/scenarios/<id>/graders.ts` |
| Deterministic grader bodies | `src/suites/agent-onboarding/catalog.ts` (`catalog` object) |
| Judge prompts | `catalog.ts` (`judgePrompts`) + `sharedJudgeGraders` |
| Generic helpers | `src/core/graders.ts` (`contains`, `matches`, `toolCallsNamed`, `transcriptText`) |
| Judge mechanics | `src/core/judge.ts` (returns `skip` on `UNKNOWN`) |

## Step 2: read the RunResult evidence

Graders read fields off `RunResult` (`src/core/types.ts`). Map the failing grader to the field it checks and compare against what the agent actually did in the run output:

- `trackedCommands` — raw connect command strings (flag checks like `--keyless`, `--secret-key`, `--slack-config-token`).
- `toolCalls` — every `Bash` / `BashOutput` / `AskUserQuestion` / `Read` call with args (`run_in_background`, `file_path`, picker `selectedId`).
- `polledShellIds` / `killedShellIds` — background-polling and kill behavior.
- `capturedUrls` / `openedFiles` — surfaced URLs and opened files (e.g. QR `.png`, auth-url file).
- `finalText` / `assistantMessages` — user-facing report (`transcriptText` joins these).
- `metadata.description` — the drafted agent description (persona / infra-token graders).

## Step 3: classify the failure

Walk top-down and stop at the first match:

| Symptom | Verdict | Fix target |
| --- | --- | --- |
| Agent never ran the tracked command / ignored an instruction it should follow | **Real — discovery** | Playbook `agent-onboarding.md` (instruction unclear/missing) |
| Deterministic grader fails and the `RunResult` confirms the agent genuinely did the wrong thing | **Real — execution** | Playbook `agent-onboarding.md` |
| Deterministic grader fails but `RunResult` shows the agent behaved correctly (regex too strict, wrong field, valid variant rejected) | **Test bug** | `catalog.ts` grader logic |
| Fails only on the scripted CLI path; tape stdout/`when`/`validate` or scripted answers are wrong or stale | **Test bug** | `scenario.ts` (`tape`, `scriptedAnswers`), `connect-parser.ts` |
| Judge grader fails but the description/report actually satisfies the criterion | **Test bug** | Judge prompt in `catalog.ts` (`judgePrompts`) |
| Judge verdict flips run-to-run | **Flaky judge** | Sharpen judge prompt; rely on `UNKNOWN`→`skip` escape hatch |
| Passes sometimes, fails sometimes, no clear cause | **Flaky** | Do not edit playbook; re-run (Rule 0) |

A scenario passes only when every active grader averages ≥ `0.8` (`JUDGE_THRESHOLD`). A judge returning `UNKNOWN` becomes `skip` and scores `1` — it never causes a fail, so an `UNKNOWN` is not evidence of a real regression.

## Step 4: apply one bounded fix, then verify

1. Change **only** the layer the verdict points to — playbook **or** test, never both to chase green.
2. Re-run the single scenario (Step 0 command), with `NOVU_EVAL_JUDGE=true` if a judge grader was involved.
3. Confirm the fix holds across the 3–5 re-runs and that no other scenario regressed.
4. If editing a deterministic grader, also run the synthetic unit tests so you don't break grader contracts:

```bash
pnpm --filter @novu/agent-evals test
```

## Output format

Report the verdict concisely with cited evidence:

```
Scenario: <id>
Failing grader: <name> (deterministic | judge)
Re-run result: <N/M failed> → real | flaky
Evidence: <RunResult field + actual vs expected>
Verdict: real playbook regression | test bug (<grader|tape|scenario|judge>) | flaky
Fix target: <file path> (or: no change — flaky/UNKNOWN)
```

## Additional resources

For worked triage examples (real regression vs test bug vs flaky judge), see [reference.md](reference.md).
128 changes: 128 additions & 0 deletions .cursor/skills/triage-agent-eval-failures/reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Triage examples

Worked examples for the `triage-agent-eval-failures` skill. Each walks through evidence → verdict → fix target.

## Example 1: Real playbook regression — `usedDashboardOAuthWhenPrompted`

**Scenario:** `dashboard-prompt-login`
**Failing grader:** `usedDashboardOAuthWhenPrompted` (deterministic)
**Re-run result:** 5/5 failed → real

**Evidence:**

```
userPrompt: "I'm signed in to the Novu dashboard..."
trackedCommands: ["npx novu connect --keyless --channel slack"]
```

The grader in `catalog.ts` checks: when `userPrompt` mentions "signed in to the Novu dashboard", every `trackedCommands` entry must omit `--keyless`. The agent ran connect with `--keyless` anyway.

**Verdict:** Real — execution. The playbook did not steer the agent toward dashboard OAuth when the user says they are signed in.

**Fix target:** `packages/shared/docs/agent-onboarding.md` — clarify that dashboard-signed-in users must omit `--keyless`.

**Do not:** Loosen the grader to accept `--keyless` when the prompt mentions the dashboard.

---

## Example 2: Test bug — `readAuthUrlFile` with correct behavior

**Scenario:** `dashboard-prompt-login`
**Failing grader:** `readAuthUrlFile` (deterministic)
**Re-run result:** 5/5 failed → real (but test is wrong)

**Evidence:**

```
toolCalls: [
{ name: "Read", args: { file_path: "/project/novu-connect-auth-url.txt" } }
]
capturedUrls: ["https://auth.novu.test/oauth/device?code=abc"]
transcriptText: "Open https://auth.novu.test/oauth/device?code=abc to authorize"
```

The grader checks for `novu-connect-auth-url` in the Read path, `/oauth/device` in `capturedUrls`, or `/oauth/device` in the transcript. All three are satisfied.

**Verdict:** Test bug — grader. The failure reason may reference a path variant the check does not cover (e.g. relative vs absolute path in `file_path`). Inspect `catalog.readAuthUrlFile` for an overly narrow `includes('novu-connect-auth-url')` match.

**Fix target:** `src/suites/agent-onboarding/catalog.ts` — widen the Read path check or normalize paths before comparing.

**Do not:** Change the playbook; the agent already surfaced the auth URL correctly.

---

## Example 3: Flaky judge — `conclusionFirstReport`

**Scenario:** `dashboard-prompt-login`
**Failing grader:** `conclusionFirstReport` (judge)
**Re-run result:** 2/5 failed → flaky

**Evidence (passing run):**

```
finalText: "✓ Your agent is live. Open the dashboard to manage it: https://dashboard.novu.test/agents/dash-agent-1"
```

**Evidence (failing run, same agent output):**

```
finalText: "✓ Your agent is live. Open the dashboard to manage it: https://dashboard.novu.test/agents/dash-agent-1"
judge rationale: "The message leads with a success statement but then adds setup context before the next action."
```

The deterministic graders all pass. The judge prompt asks whether the first line states the CLI result followed by the single next action. The agent output is identical; only the judge verdict flips.

**Verdict:** Flaky judge. Non-deterministic LLM grading on a borderline structure.

**Fix target:** Either sharpen `judgePrompts.conclusionFirstReport` in `catalog.ts` with explicit pass/fail examples, or accept variance and track pass@k. Do not edit the playbook for a 2/5 flake.

**Note:** A judge returning `UNKNOWN` scores as `skip` (pass). An `UNKNOWN` is not a regression signal.

---

## Example 4: Test bug — stale tape chunk

**Scenario:** `dashboard-prompt-login`
**Failing grader:** `reportedSuccess` (deterministic)
**Re-run result:** 5/5 failed → real (but tape is wrong)

**Evidence:**

```
trackedCommands: ["npx novu connect --channel slack"] // correct
polledShellIds: ["shell-1"] // correct
transcriptText: "Waiting for connect to finish..." // agent never saw success stdout
```

The agent polled the background shell but the final transcript never contains "agent is live". The tape in `scenario.ts` emits success stdout in the last chunk, but `connectTape` validation rejected the command before replay (e.g. `requireNoKeyless: true` but parser flags differ).

**Verdict:** Test bug — tape/scenario. The fixture did not replay the expected CLI output; the agent behaved correctly given what it received.

**Fix target:** `scenarios/dashboard-prompt-login/scenario.ts` — fix `tape` chunks or `connectTape` validation flags. Check `connect-parser.ts` if parsed flags do not match tape `when` conditions.

**Do not:** Change the playbook to tell the agent to report success when the CLI gave no success signal.

---

## Example 5: Real playbook regression — `confirmedBeforeRun`

**Scenario:** `persona-infra-exclusion`
**Failing grader:** `confirmedBeforeRun` (deterministic)
**Re-run result:** 5/5 failed → real

**Evidence:**

```
toolCalls: [
{ name: "Bash", args: { command: "npx novu connect ..." } }, // index 0
{ name: "AskUserQuestion", result: { selectedId: "approve" } } // index 2
]
```

The grader requires an `AskUserQuestion` with `selectedId: "approve"` **before** the first connect `Bash` call. Connect ran first.

**Verdict:** Real — execution. The playbook does not enforce (or the agent ignored) the confirm-before-run step.

**Fix target:** `packages/shared/docs/agent-onboarding.md` — strengthen the approval picker requirement before running connect.

**Do not:** Remove or weaken `catalog.confirmedBeforeRun`.
38 changes: 38 additions & 0 deletions .github/workflows/agent-evals.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Agent evals

on:
pull_request:
branches:
- next
paths:
- packages/shared/docs/agent-onboarding.md
Comment thread
djabarovgeorge marked this conversation as resolved.
- libs/agent-evals/**
- .github/workflows/agent-evals.yml
Comment thread
djabarovgeorge marked this conversation as resolved.

jobs:
evals:
runs-on: ubuntu-latest
timeout-minutes: 45
steps:
- name: Checkout
uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5

- name: Setup pnpm
uses: pnpm/action-setup@0e279bb959325dab635dd2c09392533439d90093 # v6.0.8
with:
version: 11.0.9

- name: Setup Node.js
uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
with:
node-version: 22
cache: pnpm

- name: Install dependencies
run: pnpm install --frozen-lockfile

- name: Run agent evals
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
NOVU_EVAL_JUDGE: 'false'
run: pnpm --filter @novu/agent-evals eval src/suites/agent-onboarding
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
import {
type ActivationRuleParams,
buildActivationOrConditions,
classifyActivationReason,
ConversationActivationReasonEnum,
type ConversationBillingState,
classifyActivationReason,
} from '@novu/dal';
import { expect } from 'chai';

Expand Down Expand Up @@ -46,10 +46,26 @@ describe('billing-activation-rules #novu-v2', () => {
const cases: Array<{ name: string; billing: ConversationBillingState | undefined }> = [
{ name: 'no billing (brand new)', billing: undefined },
{ name: 'empty billing', billing: {} },
{ name: 'counted this period, recent engagement', billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z' } },
{ name: 'counted this period, stale engagement', billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-01T00:00:00.000Z' } },
{ name: 'counted a previous period', billing: { lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' } },
{ name: 'resolved since last count', billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z', resolvedAt: '2026-06-21T00:00:00.000Z' } },
{
name: 'counted this period, recent engagement',
billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z' },
},
{
name: 'counted this period, stale engagement',
billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-01T00:00:00.000Z' },
},
{
name: 'counted a previous period',
billing: { lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' },
},
{
name: 'resolved since last count',
billing: {
lastCountedPeriodKey: PERIOD,
lastEngagementAt: '2026-06-20T00:00:00.000Z',
resolvedAt: '2026-06-21T00:00:00.000Z',
},
},
{ name: 'counted, no engagement timestamp', billing: { lastCountedPeriodKey: PERIOD } },
];

Expand All @@ -68,13 +84,20 @@ describe('billing-activation-rules #novu-v2', () => {
// resolved wins over an otherwise-quiet, same-period conversation
expect(
classifyActivationReason(
{ lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z', resolvedAt: '2026-06-21T00:00:00.000Z' },
{
lastCountedPeriodKey: PERIOD,
lastEngagementAt: '2026-06-20T00:00:00.000Z',
resolvedAt: '2026-06-21T00:00:00.000Z',
},
params
)
).to.equal(ConversationActivationReasonEnum.REOPEN);

expect(
classifyActivationReason({ lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' }, params)
classifyActivationReason(
{ lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' },
params
)
).to.equal(ConversationActivationReasonEnum.NEW_CYCLE);

expect(
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
import { Injectable } from '@nestjs/common';
import { ModuleRef } from '@nestjs/core';
import { AgentEntitlementsService, AnalyticsService, PinoLogger, throwPlanLimitExceeded } from '@novu/application-generic';
import {
classifyActivationReason,
AgentEntitlementsService,
AnalyticsService,
PinoLogger,
throwPlanLimitExceeded,
} from '@novu/application-generic';
import {
CommunityOrganizationRepository,
ConversationActivationReasonEnum,
ConversationActivationRepository,
ConversationEntity,
ConversationRepository,
ConversationThreadKindEnum,
classifyActivationReason,
} from '@novu/dal';
import { ApiServiceLevelEnum, UNLIMITED_VALUE } from '@novu/shared';
import {
Expand Down Expand Up @@ -417,7 +422,10 @@ export class ConversationActivationService {
return;
}

const currentCount = await this.activationRepository.countForOrganizationPeriod(context.organizationId, periodKey);
const currentCount = await this.activationRepository.countForOrganizationPeriod(
context.organizationId,
periodKey
);
if (currentCount >= limit) {
trackAgentActiveConversationLimitReached(this.analyticsService, {
organizationId: context.organizationId,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,7 @@ import { AgentConfigResolver, ResolvedAgentConfig } from '../../channels/agent-c
import type { ReplyContentDto } from '../../shared/dtos/agent-reply-payload.dto';
import { AgentPlatformEnum } from '../../shared/enums/agent-platform.enum';
import { esmImport } from '../../shared/util/esm-import';
import {
buildPoweredByWatermark,
contentHasPoweredByWatermark,
} from '../../shared/util/novu-powered-by-watermark';
import { buildPoweredByWatermark, contentHasPoweredByWatermark } from '../../shared/util/novu-powered-by-watermark';
import { type AgentActionTokenBinding, AgentActionTokenService } from '../action-token/agent-action-token.service';
import { AgentConversationService } from '../conversation/agent-conversation.service';
import { ChatInstanceRegistry } from '../ingress/chat-instance.registry';
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,12 @@ import { type AutoProvisionPlatform, isAutoProvisionPlatform } from '../../share
import { InboundAckService } from '../ack/inbound-ack.service';
import { AgentAttachmentStorage, type StoredAttachment } from '../conversation/agent-attachment-storage.service';
import { AgentConversationService, getInboundActivityPreview } from '../conversation/agent-conversation.service';
import { ConversationActivationService } from '../conversation/conversation-activation.service';
import {
AgentSubscriberResolver,
BotAuthorSkippedError,
ConnectOrgSubscriberCapExceededError,
} from '../conversation/agent-subscriber-resolver.service';
import { ConversationActivationService } from '../conversation/conversation-activation.service';
import { OutboundGateway } from '../egress/outbound.gateway';
import type { BridgeReaction } from '../runtime/bridge-executor.service';
import type { ConversationTurn } from '../runtime/conversation-turn';
Expand Down
Loading
Loading