novuhq · djabarovgeorge · Jun 22, 2026 · Jun 16, 2026 · Jun 16, 2026 · Jun 17, 2026
diff --git a/.cursor/skills/triage-agent-eval-failures/SKILL.md b/.cursor/skills/triage-agent-eval-failures/SKILL.md
@@ -0,0 +1,99 @@
+---
+name: triage-agent-eval-failures
+description: Triage failing @novu/agent-evals scenarios to decide whether a failure is real or flaky, and whether to fix the playbook/prompt or the test (grader, tape, scenario, or judge). Use when an agent-evals scenario fails, when the user asks why an eval is red, or when deciding whether to fix the test or the prompt.
+---
+
+# Triage Agent Eval Failures
+
+Diagnose a failing scenario in `libs/agent-evals` and produce a verdict: is the failure **real** (the playbook under test regressed) or is the **test** wrong (grader / tape / scenario / judge), or is it just **flaky** (model non-determinism)?
+
+The thing under test is the playbook doc (`packages/shared/docs/agent-onboarding.md`), injected as the agent system prompt. Everything else (`graders.ts`, `catalog.ts`, `scenario.ts`, judge prompts) is test scaffolding. **Never fix the playbook to satisfy a broken grader, and never loosen a grader to hide a real playbook regression.**
+
+## Rule 0: rule out flakiness before changing anything
+
+Scenarios run a live model concurrently, so one red run is one sample, not a verdict. Re-run the single failing scenario 3–5× first:
+
+```bash
+pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t <scenario-id>
+```
+
+- Fails **every** run → deterministic failure, continue triage.
+- Fails **intermittently** → flaky. The cause is usually a non-deterministic judge grader or an over-strict regex. Do not edit the playbook. Tighten the grader/judge prompt or accept variance; consider pass@k rather than single-run gating.
+
+To reproduce judge graders locally (PR/push CI runs deterministic graders only):
+
+```bash
+NOVU_EVAL_JUDGE=true pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t <scenario-id>
+```
+
+## Step 1: identify which grader failed and its kind
+
+Each scenario registers graders in `scenarios/<id>/graders.ts`. The **kind** is the strongest triage signal:
+
+- **Deterministic** graders (`catalog.*`, `contains`, `matches`) inspect the structured `RunResult`. A fail means the agent's actions/output objectively did not match — or the check is too strict.
+- **Judge** graders (`sharedJudgeGraders`, `judge(...)`) call a second LLM pass. A fail is fuzzy and can be the judge prompt's fault, not the agent's.
+
+Find the grader's logic:
+
+| Layer | Location |
+| --- | --- |
+| Per-scenario grader wiring | `src/suites/agent-onboarding/scenarios/<id>/graders.ts` |
+| Deterministic grader bodies | `src/suites/agent-onboarding/catalog.ts` (`catalog` object) |
+| Judge prompts | `catalog.ts` (`judgePrompts`) + `sharedJudgeGraders` |
+| Generic helpers | `src/core/graders.ts` (`contains`, `matches`, `toolCallsNamed`, `transcriptText`) |
+| Judge mechanics | `src/core/judge.ts` (returns `skip` on `UNKNOWN`) |
+
+## Step 2: read the RunResult evidence
+
+Graders read fields off `RunResult` (`src/core/types.ts`). Map the failing grader to the field it checks and compare against what the agent actually did in the run output:
+
+- `trackedCommands` — raw connect command strings (flag checks like `--keyless`, `--secret-key`, `--slack-config-token`).
+- `toolCalls` — every `Bash` / `BashOutput` / `AskUserQuestion` / `Read` call with args (`run_in_background`, `file_path`, picker `selectedId`).
+- `polledShellIds` / `killedShellIds` — background-polling and kill behavior.
+- `capturedUrls` / `openedFiles` — surfaced URLs and opened files (e.g. QR `.png`, auth-url file).
+- `finalText` / `assistantMessages` — user-facing report (`transcriptText` joins these).
+- `metadata.description` — the drafted agent description (persona / infra-token graders).
+
+## Step 3: classify the failure
+
+Walk top-down and stop at the first match:
+
+| Symptom | Verdict | Fix target |
+| --- | --- | --- |
+| Agent never ran the tracked command / ignored an instruction it should follow | **Real — discovery** | Playbook `agent-onboarding.md` (instruction unclear/missing) |
+| Deterministic grader fails and the `RunResult` confirms the agent genuinely did the wrong thing | **Real — execution** | Playbook `agent-onboarding.md` |
+| Deterministic grader fails but `RunResult` shows the agent behaved correctly (regex too strict, wrong field, valid variant rejected) | **Test bug** | `catalog.ts` grader logic |
+| Fails only on the scripted CLI path; tape stdout/`when`/`validate` or scripted answers are wrong or stale | **Test bug** | `scenario.ts` (`tape`, `scriptedAnswers`), `connect-parser.ts` |
+| Judge grader fails but the description/report actually satisfies the criterion | **Test bug** | Judge prompt in `catalog.ts` (`judgePrompts`) |
+| Judge verdict flips run-to-run | **Flaky judge** | Sharpen judge prompt; rely on `UNKNOWN`→`skip` escape hatch |
+| Passes sometimes, fails sometimes, no clear cause | **Flaky** | Do not edit playbook; re-run (Rule 0) |
+
+A scenario passes only when every active grader averages ≥ `0.8` (`JUDGE_THRESHOLD`). A judge returning `UNKNOWN` becomes `skip` and scores `1` — it never causes a fail, so an `UNKNOWN` is not evidence of a real regression.
+
+## Step 4: apply one bounded fix, then verify
+
+1. Change **only** the layer the verdict points to — playbook **or** test, never both to chase green.
+2. Re-run the single scenario (Step 0 command), with `NOVU_EVAL_JUDGE=true` if a judge grader was involved.
+3. Confirm the fix holds across the 3–5 re-runs and that no other scenario regressed.
+4. If editing a deterministic grader, also run the synthetic unit tests so you don't break grader contracts:
+
+```bash
+pnpm --filter @novu/agent-evals test
+```
+
+## Output format
+
+Report the verdict concisely with cited evidence:
+
+```
+Scenario: <id>
+Failing grader: <name> (deterministic | judge)
+Re-run result: <N/M failed> → real | flaky
+Evidence: <RunResult field + actual vs expected>
+Verdict: real playbook regression | test bug (<grader|tape|scenario|judge>) | flaky
+Fix target: <file path>  (or: no change — flaky/UNKNOWN)
+```
+
+## Additional resources
+
+For worked triage examples (real regression vs test bug vs flaky judge), see [reference.md](reference.md).
diff --git a/.cursor/skills/triage-agent-eval-failures/reference.md b/.cursor/skills/triage-agent-eval-failures/reference.md
@@ -0,0 +1,128 @@
+# Triage examples
+
+Worked examples for the `triage-agent-eval-failures` skill. Each walks through evidence → verdict → fix target.
+
+## Example 1: Real playbook regression — `usedDashboardOAuthWhenPrompted`
+
+**Scenario:** `dashboard-prompt-login`  
+**Failing grader:** `usedDashboardOAuthWhenPrompted` (deterministic)  
+**Re-run result:** 5/5 failed → real
+
+**Evidence:**
+
+```
+userPrompt: "I'm signed in to the Novu dashboard..."
+trackedCommands: ["npx novu connect --keyless --channel slack"]
+```
+
+The grader in `catalog.ts` checks: when `userPrompt` mentions "signed in to the Novu dashboard", every `trackedCommands` entry must omit `--keyless`. The agent ran connect with `--keyless` anyway.
+
+**Verdict:** Real — execution. The playbook did not steer the agent toward dashboard OAuth when the user says they are signed in.
+
+**Fix target:** `packages/shared/docs/agent-onboarding.md` — clarify that dashboard-signed-in users must omit `--keyless`.
+
+**Do not:** Loosen the grader to accept `--keyless` when the prompt mentions the dashboard.
+
+---
+
+## Example 2: Test bug — `readAuthUrlFile` with correct behavior
+
+**Scenario:** `dashboard-prompt-login`  
+**Failing grader:** `readAuthUrlFile` (deterministic)  
+**Re-run result:** 5/5 failed → real (but test is wrong)
+
+**Evidence:**
+
+```
+toolCalls: [
+  { name: "Read", args: { file_path: "/project/novu-connect-auth-url.txt" } }
+]
+capturedUrls: ["https://auth.novu.test/oauth/device?code=abc"]
+transcriptText: "Open https://auth.novu.test/oauth/device?code=abc to authorize"
+```
+
+The grader checks for `novu-connect-auth-url` in the Read path, `/oauth/device` in `capturedUrls`, or `/oauth/device` in the transcript. All three are satisfied.
+
+**Verdict:** Test bug — grader. The failure reason may reference a path variant the check does not cover (e.g. relative vs absolute path in `file_path`). Inspect `catalog.readAuthUrlFile` for an overly narrow `includes('novu-connect-auth-url')` match.
+
+**Fix target:** `src/suites/agent-onboarding/catalog.ts` — widen the Read path check or normalize paths before comparing.
+
+**Do not:** Change the playbook; the agent already surfaced the auth URL correctly.
+
+---
+
+## Example 3: Flaky judge — `conclusionFirstReport`
+
+**Scenario:** `dashboard-prompt-login`  
+**Failing grader:** `conclusionFirstReport` (judge)  
+**Re-run result:** 2/5 failed → flaky
+
+**Evidence (passing run):**
+
+```
+finalText: "✓ Your agent is live. Open the dashboard to manage it: https://dashboard.novu.test/agents/dash-agent-1"
+```
+
+**Evidence (failing run, same agent output):**
+
+```
+finalText: "✓ Your agent is live. Open the dashboard to manage it: https://dashboard.novu.test/agents/dash-agent-1"
+judge rationale: "The message leads with a success statement but then adds setup context before the next action."
+```
+
+The deterministic graders all pass. The judge prompt asks whether the first line states the CLI result followed by the single next action. The agent output is identical; only the judge verdict flips.
+
+**Verdict:** Flaky judge. Non-deterministic LLM grading on a borderline structure.
+
+**Fix target:** Either sharpen `judgePrompts.conclusionFirstReport` in `catalog.ts` with explicit pass/fail examples, or accept variance and track pass@k. Do not edit the playbook for a 2/5 flake.
+
+**Note:** A judge returning `UNKNOWN` scores as `skip` (pass). An `UNKNOWN` is not a regression signal.
+
+---
+
+## Example 4: Test bug — stale tape chunk
+
+**Scenario:** `dashboard-prompt-login`  
+**Failing grader:** `reportedSuccess` (deterministic)  
+**Re-run result:** 5/5 failed → real (but tape is wrong)
+
+**Evidence:**
+
+```
+trackedCommands: ["npx novu connect --channel slack"]  // correct
+polledShellIds: ["shell-1"]  // correct
+transcriptText: "Waiting for connect to finish..."  // agent never saw success stdout
+```
+
+The agent polled the background shell but the final transcript never contains "agent is live". The tape in `scenario.ts` emits success stdout in the last chunk, but `connectTape` validation rejected the command before replay (e.g. `requireNoKeyless: true` but parser flags differ).
+
+**Verdict:** Test bug — tape/scenario. The fixture did not replay the expected CLI output; the agent behaved correctly given what it received.
+
+**Fix target:** `scenarios/dashboard-prompt-login/scenario.ts` — fix `tape` chunks or `connectTape` validation flags. Check `connect-parser.ts` if parsed flags do not match tape `when` conditions.
+
+**Do not:** Change the playbook to tell the agent to report success when the CLI gave no success signal.
+
+---
+
+## Example 5: Real playbook regression — `confirmedBeforeRun`
+
+**Scenario:** `persona-infra-exclusion`  
+**Failing grader:** `confirmedBeforeRun` (deterministic)  
+**Re-run result:** 5/5 failed → real
+
+**Evidence:**
+
+```
+toolCalls: [
+  { name: "Bash", args: { command: "npx novu connect ..." } },  // index 0
+  { name: "AskUserQuestion", result: { selectedId: "approve" } }  // index 2
+]
+```
+
+The grader requires an `AskUserQuestion` with `selectedId: "approve"` **before** the first connect `Bash` call. Connect ran first.
+
+**Verdict:** Real — execution. The playbook does not enforce (or the agent ignored) the confirm-before-run step.
+
+**Fix target:** `packages/shared/docs/agent-onboarding.md` — strengthen the approval picker requirement before running connect.
+
+**Do not:** Remove or weaken `catalog.confirmedBeforeRun`.
diff --git a/.github/workflows/agent-evals.yml b/.github/workflows/agent-evals.yml
@@ -0,0 +1,38 @@
+name: Agent evals
+
+on:
+  pull_request:
+    branches:
+      - next
+    paths:
+      - packages/shared/docs/agent-onboarding.md
+      - libs/agent-evals/**
+      - .github/workflows/agent-evals.yml
+
+jobs:
+  evals:
+    runs-on: ubuntu-latest
+    timeout-minutes: 45
+    steps:
+      - name: Checkout
+        uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5
+
+      - name: Setup pnpm
+        uses: pnpm/action-setup@0e279bb959325dab635dd2c09392533439d90093 # v6.0.8
+        with:
+          version: 11.0.9
+
+      - name: Setup Node.js
+        uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
+        with:
+          node-version: 22
+          cache: pnpm
+
+      - name: Install dependencies
+        run: pnpm install --frozen-lockfile
+
+      - name: Run agent evals
+        env:
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+          NOVU_EVAL_JUDGE: 'false'
+        run: pnpm --filter @novu/agent-evals eval src/suites/agent-onboarding
diff --git a/apps/api/src/app/agents/conversation-runtime/conversation/billing-activation-rules.spec.ts b/apps/api/src/app/agents/conversation-runtime/conversation/billing-activation-rules.spec.ts
@@ -1,9 +1,9 @@
 import {
   type ActivationRuleParams,
   buildActivationOrConditions,
-  classifyActivationReason,
   ConversationActivationReasonEnum,
   type ConversationBillingState,
+  classifyActivationReason,
 } from '@novu/dal';
 import { expect } from 'chai';
 
@@ -46,10 +46,26 @@ describe('billing-activation-rules #novu-v2', () => {
   const cases: Array<{ name: string; billing: ConversationBillingState | undefined }> = [
     { name: 'no billing (brand new)', billing: undefined },
     { name: 'empty billing', billing: {} },
-    { name: 'counted this period, recent engagement', billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z' } },
-    { name: 'counted this period, stale engagement', billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-01T00:00:00.000Z' } },
-    { name: 'counted a previous period', billing: { lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' } },
-    { name: 'resolved since last count', billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z', resolvedAt: '2026-06-21T00:00:00.000Z' } },
+    {
+      name: 'counted this period, recent engagement',
+      billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z' },
+    },
+    {
+      name: 'counted this period, stale engagement',
+      billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-01T00:00:00.000Z' },
+    },
+    {
+      name: 'counted a previous period',
+      billing: { lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' },
+    },
+    {
+      name: 'resolved since last count',
+      billing: {
+        lastCountedPeriodKey: PERIOD,
+        lastEngagementAt: '2026-06-20T00:00:00.000Z',
+        resolvedAt: '2026-06-21T00:00:00.000Z',
+      },
+    },
     { name: 'counted, no engagement timestamp', billing: { lastCountedPeriodKey: PERIOD } },
   ];
 
@@ -68,13 +84,20 @@ describe('billing-activation-rules #novu-v2', () => {
     // resolved wins over an otherwise-quiet, same-period conversation
     expect(
       classifyActivationReason(
-        { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z', resolvedAt: '2026-06-21T00:00:00.000Z' },
+        {
+          lastCountedPeriodKey: PERIOD,
+          lastEngagementAt: '2026-06-20T00:00:00.000Z',
+          resolvedAt: '2026-06-21T00:00:00.000Z',
+        },
         params
       )
     ).to.equal(ConversationActivationReasonEnum.REOPEN);
 
     expect(
-      classifyActivationReason({ lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' }, params)
+      classifyActivationReason(
+        { lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' },
+        params
+      )
     ).to.equal(ConversationActivationReasonEnum.NEW_CYCLE);
 
     expect(

diff --git a/apps/api/src/app/agents/conversation-runtime/conversation/conversation-activation.service.ts b/apps/api/src/app/agents/conversation-runtime/conversation/conversation-activation.service.ts
@@ -1,14 +1,19 @@
 import { Injectable } from '@nestjs/common';
 import { ModuleRef } from '@nestjs/core';
-import { AgentEntitlementsService, AnalyticsService, PinoLogger, throwPlanLimitExceeded } from '@novu/application-generic';
 import {
-  classifyActivationReason,
+  AgentEntitlementsService,
+  AnalyticsService,
+  PinoLogger,
+  throwPlanLimitExceeded,
+} from '@novu/application-generic';
+import {
   CommunityOrganizationRepository,
   ConversationActivationReasonEnum,
   ConversationActivationRepository,
   ConversationEntity,
   ConversationRepository,
   ConversationThreadKindEnum,
+  classifyActivationReason,
 } from '@novu/dal';
 import { ApiServiceLevelEnum, UNLIMITED_VALUE } from '@novu/shared';
 import {
@@ -417,7 +422,10 @@ export class ConversationActivationService {
         return;
       }
 
-      const currentCount = await this.activationRepository.countForOrganizationPeriod(context.organizationId, periodKey);
+      const currentCount = await this.activationRepository.countForOrganizationPeriod(
+        context.organizationId,
+        periodKey
+      );
       if (currentCount >= limit) {
         trackAgentActiveConversationLimitReached(this.analyticsService, {
           organizationId: context.organizationId,

diff --git a/apps/api/src/app/agents/conversation-runtime/egress/outbound.gateway.ts b/apps/api/src/app/agents/conversation-runtime/egress/outbound.gateway.ts
@@ -7,10 +7,7 @@ import { AgentConfigResolver, ResolvedAgentConfig } from '../../channels/agent-c
 import type { ReplyContentDto } from '../../shared/dtos/agent-reply-payload.dto';
 import { AgentPlatformEnum } from '../../shared/enums/agent-platform.enum';
 import { esmImport } from '../../shared/util/esm-import';
-import {
-  buildPoweredByWatermark,
-  contentHasPoweredByWatermark,
-} from '../../shared/util/novu-powered-by-watermark';
+import { buildPoweredByWatermark, contentHasPoweredByWatermark } from '../../shared/util/novu-powered-by-watermark';
 import { type AgentActionTokenBinding, AgentActionTokenService } from '../action-token/agent-action-token.service';
 import { AgentConversationService } from '../conversation/agent-conversation.service';
 import { ChatInstanceRegistry } from '../ingress/chat-instance.registry';

diff --git a/apps/api/src/app/agents/conversation-runtime/ingress/inbound-turn.handler.ts b/apps/api/src/app/agents/conversation-runtime/ingress/inbound-turn.handler.ts
@@ -32,12 +32,12 @@ import { type AutoProvisionPlatform, isAutoProvisionPlatform } from '../../share
 import { InboundAckService } from '../ack/inbound-ack.service';
 import { AgentAttachmentStorage, type StoredAttachment } from '../conversation/agent-attachment-storage.service';
 import { AgentConversationService, getInboundActivityPreview } from '../conversation/agent-conversation.service';
-import { ConversationActivationService } from '../conversation/conversation-activation.service';
 import {
   AgentSubscriberResolver,
   BotAuthorSkippedError,
   ConnectOrgSubscriberCapExceededError,
 } from '../conversation/agent-subscriber-resolver.service';
+import { ConversationActivationService } from '../conversation/conversation-activation.service';
 import { OutboundGateway } from '../egress/outbound.gateway';
 import type { BridgeReaction } from '../runtime/bridge-executor.service';
 import type { ConversationTurn } from '../runtime/conversation-turn';