diff --git a/.cursor/skills/triage-agent-eval-failures/SKILL.md b/.cursor/skills/triage-agent-eval-failures/SKILL.md new file mode 100644 index 00000000000..cb6007a673e --- /dev/null +++ b/.cursor/skills/triage-agent-eval-failures/SKILL.md @@ -0,0 +1,99 @@ +--- +name: triage-agent-eval-failures +description: Triage failing @novu/agent-evals scenarios to decide whether a failure is real or flaky, and whether to fix the playbook/prompt or the test (grader, tape, scenario, or judge). Use when an agent-evals scenario fails, when the user asks why an eval is red, or when deciding whether to fix the test or the prompt. +--- + +# Triage Agent Eval Failures + +Diagnose a failing scenario in `libs/agent-evals` and produce a verdict: is the failure **real** (the playbook under test regressed) or is the **test** wrong (grader / tape / scenario / judge), or is it just **flaky** (model non-determinism)? + +The thing under test is the playbook doc (`packages/shared/docs/agent-onboarding.md`), injected as the agent system prompt. Everything else (`graders.ts`, `catalog.ts`, `scenario.ts`, judge prompts) is test scaffolding. **Never fix the playbook to satisfy a broken grader, and never loosen a grader to hide a real playbook regression.** + +## Rule 0: rule out flakiness before changing anything + +Scenarios run a live model concurrently, so one red run is one sample, not a verdict. Re-run the single failing scenario 3–5× first: + +```bash +pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t +``` + +- Fails **every** run → deterministic failure, continue triage. +- Fails **intermittently** → flaky. The cause is usually a non-deterministic judge grader or an over-strict regex. Do not edit the playbook. Tighten the grader/judge prompt or accept variance; consider pass@k rather than single-run gating. + +To reproduce judge graders locally: + +```bash +pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t +``` + +## Step 1: identify which grader failed and its kind + +Each scenario registers graders in `scenarios//graders.ts`. The **kind** is the strongest triage signal: + +- **Deterministic** graders (`catalog.*`, `contains`, `matches`) inspect the structured `RunResult`. A fail means the agent's actions/output objectively did not match — or the check is too strict. +- **Judge** graders (`sharedJudgeGraders`, `judge(...)`) call a second LLM pass. A fail is fuzzy and can be the judge prompt's fault, not the agent's. + +Find the grader's logic: + +| Layer | Location | +| --- | --- | +| Per-scenario grader wiring | `src/suites/agent-onboarding/scenarios//graders.ts` | +| Deterministic grader bodies | `src/suites/agent-onboarding/catalog.ts` (`catalog` object) | +| Judge prompts | `catalog.ts` (`judgePrompts`) + `sharedJudgeGraders` | +| Generic helpers | `src/core/graders.ts` (`contains`, `matches`, `toolCallsNamed`, `transcriptText`) | +| Judge mechanics | `src/core/judge.ts` (returns `skip` on `UNKNOWN`) | + +## Step 2: read the RunResult evidence + +Graders read fields off `RunResult` (`src/core/types.ts`). Map the failing grader to the field it checks and compare against what the agent actually did in the run output: + +- `trackedCommands` — raw connect command strings (flag checks like `--keyless`, `--secret-key`, `--slack-config-token`). +- `toolCalls` — every `Bash` / `BashOutput` / `AskUserQuestion` / `Read` call with args (`run_in_background`, `file_path`, picker `selectedId`). +- `polledShellIds` / `killedShellIds` — background-polling and kill behavior. +- `capturedUrls` / `openedFiles` — surfaced URLs and opened files (e.g. QR `.png`, auth-url file). +- `finalText` / `assistantMessages` — user-facing report (`transcriptText` joins these). +- `metadata.description` — the drafted agent description (persona / infra-token graders). + +## Step 3: classify the failure + +Walk top-down and stop at the first match: + +| Symptom | Verdict | Fix target | +| --- | --- | --- | +| Agent never ran the tracked command / ignored an instruction it should follow | **Real — discovery** | Playbook `agent-onboarding.md` (instruction unclear/missing) | +| Deterministic grader fails and the `RunResult` confirms the agent genuinely did the wrong thing | **Real — execution** | Playbook `agent-onboarding.md` | +| Deterministic grader fails but `RunResult` shows the agent behaved correctly (regex too strict, wrong field, valid variant rejected) | **Test bug** | `catalog.ts` grader logic | +| Fails only on the scripted CLI path; tape stdout/`when`/`validate` or scripted answers are wrong or stale | **Test bug** | `scenario.ts` (`tape`, `scriptedAnswers`), `connect-parser.ts` | +| Judge grader fails but the description/report actually satisfies the criterion | **Test bug** | Judge prompt in `catalog.ts` (`judgePrompts`) | +| Judge verdict flips run-to-run | **Flaky judge** | Sharpen judge prompt; rely on `UNKNOWN`→`skip` escape hatch | +| Passes sometimes, fails sometimes, no clear cause | **Flaky** | Do not edit playbook; re-run (Rule 0) | + +A scenario passes only when every active grader averages ≥ `0.8` (`JUDGE_THRESHOLD`). A judge returning `UNKNOWN` becomes `skip` and scores `1` — it never causes a fail, so an `UNKNOWN` is not evidence of a real regression. + +## Step 4: apply one bounded fix, then verify + +1. Change **only** the layer the verdict points to — playbook **or** test, never both to chase green. +2. Re-run the single scenario (Step 0 command). +3. Confirm the fix holds across the 3–5 re-runs and that no other scenario regressed. +4. If editing a deterministic grader, also run the synthetic unit tests so you don't break grader contracts: + +```bash +pnpm --filter @novu/agent-evals test +``` + +## Output format + +Report the verdict concisely with cited evidence: + +``` +Scenario: +Failing grader: (deterministic | judge) +Re-run result: → real | flaky +Evidence: +Verdict: real playbook regression | test bug () | flaky +Fix target: (or: no change — flaky/UNKNOWN) +``` + +## Additional resources + +For worked triage examples (real regression vs test bug vs flaky judge), see [reference.md](reference.md). diff --git a/.cursor/skills/triage-agent-eval-failures/reference.md b/.cursor/skills/triage-agent-eval-failures/reference.md new file mode 100644 index 00000000000..fc317af47e2 --- /dev/null +++ b/.cursor/skills/triage-agent-eval-failures/reference.md @@ -0,0 +1,128 @@ +# Triage examples + +Worked examples for the `triage-agent-eval-failures` skill. Each walks through evidence → verdict → fix target. + +## Example 1: Real playbook regression — `usedDashboardOAuthWhenPrompted` + +**Scenario:** `dashboard-prompt-login` +**Failing grader:** `usedDashboardOAuthWhenPrompted` (deterministic) +**Re-run result:** 5/5 failed → real + +**Evidence:** + +``` +userPrompt: "I'm signed in to the Novu dashboard..." +trackedCommands: ["npx novu connect --keyless --channel slack"] +``` + +The grader in `catalog.ts` checks: when `userPrompt` mentions "signed in to the Novu dashboard", every `trackedCommands` entry must omit `--keyless`. The agent ran connect with `--keyless` anyway. + +**Verdict:** Real — execution. The playbook did not steer the agent toward dashboard OAuth when the user says they are signed in. + +**Fix target:** `packages/shared/docs/agent-onboarding.md` — clarify that dashboard-signed-in users must omit `--keyless`. + +**Do not:** Loosen the grader to accept `--keyless` when the prompt mentions the dashboard. + +--- + +## Example 2: Test bug — `readAuthUrlFile` with correct behavior + +**Scenario:** `dashboard-prompt-login` +**Failing grader:** `readAuthUrlFile` (deterministic) +**Re-run result:** 5/5 failed → real (but test is wrong) + +**Evidence:** + +``` +toolCalls: [ + { name: "Read", args: { file_path: "/project/novu-connect-auth-url.txt" } } +] +capturedUrls: ["https://auth.novu.test/oauth/device?code=abc"] +transcriptText: "Open https://auth.novu.test/oauth/device?code=abc to authorize" +``` + +The grader checks for `novu-connect-auth-url` in the Read path, `/oauth/device` in `capturedUrls`, or `/oauth/device` in the transcript. All three are satisfied. + +**Verdict:** Test bug — grader. The failure reason may reference a path variant the check does not cover (e.g. relative vs absolute path in `file_path`). Inspect `catalog.readAuthUrlFile` for an overly narrow `includes('novu-connect-auth-url')` match. + +**Fix target:** `src/suites/agent-onboarding/catalog.ts` — widen the Read path check or normalize paths before comparing. + +**Do not:** Change the playbook; the agent already surfaced the auth URL correctly. + +--- + +## Example 3: Flaky judge — `conclusionFirstReport` + +**Scenario:** `dashboard-prompt-login` +**Failing grader:** `conclusionFirstReport` (judge) +**Re-run result:** 2/5 failed → flaky + +**Evidence (passing run):** + +``` +finalText: "✓ Your agent is live. Open the dashboard to manage it: https://dashboard.novu.test/agents/dash-agent-1" +``` + +**Evidence (failing run, same agent output):** + +``` +finalText: "✓ Your agent is live. Open the dashboard to manage it: https://dashboard.novu.test/agents/dash-agent-1" +judge rationale: "The message leads with a success statement but then adds setup context before the next action." +``` + +The deterministic graders all pass. The judge prompt asks whether the first line states the CLI result followed by the single next action. The agent output is identical; only the judge verdict flips. + +**Verdict:** Flaky judge. Non-deterministic LLM grading on a borderline structure. + +**Fix target:** Either sharpen `judgePrompts.conclusionFirstReport` in `catalog.ts` with explicit pass/fail examples, or accept variance and track pass@k. Do not edit the playbook for a 2/5 flake. + +**Note:** A judge returning `UNKNOWN` scores as `skip` (pass). An `UNKNOWN` is not a regression signal. + +--- + +## Example 4: Test bug — stale tape chunk + +**Scenario:** `dashboard-prompt-login` +**Failing grader:** `reportedSuccess` (deterministic) +**Re-run result:** 5/5 failed → real (but tape is wrong) + +**Evidence:** + +``` +trackedCommands: ["npx novu connect --channel slack"] // correct +polledShellIds: ["shell-1"] // correct +transcriptText: "Waiting for connect to finish..." // agent never saw success stdout +``` + +The agent polled the background shell but the final transcript never contains "agent is live". The tape in `scenario.ts` emits success stdout in the last chunk, but `connectTape` validation rejected the command before replay (e.g. `requireNoKeyless: true` but parser flags differ). + +**Verdict:** Test bug — tape/scenario. The fixture did not replay the expected CLI output; the agent behaved correctly given what it received. + +**Fix target:** `scenarios/dashboard-prompt-login/scenario.ts` — fix `tape` chunks or `connectTape` validation flags. Check `connect-parser.ts` if parsed flags do not match tape `when` conditions. + +**Do not:** Change the playbook to tell the agent to report success when the CLI gave no success signal. + +--- + +## Example 5: Real playbook regression — `confirmedBeforeRun` + +**Scenario:** `persona-infra-exclusion` +**Failing grader:** `confirmedBeforeRun` (deterministic) +**Re-run result:** 5/5 failed → real + +**Evidence:** + +``` +toolCalls: [ + { name: "Bash", args: { command: "npx novu connect ..." } }, // index 0 + { name: "AskUserQuestion", result: { selectedId: "approve" } } // index 2 +] +``` + +The grader requires an `AskUserQuestion` with `selectedId: "approve"` **before** the first connect `Bash` call. Connect ran first. + +**Verdict:** Real — execution. The playbook does not enforce (or the agent ignored) the confirm-before-run step. + +**Fix target:** `packages/shared/docs/agent-onboarding.md` — strengthen the approval picker requirement before running connect. + +**Do not:** Remove or weaken `catalog.confirmedBeforeRun`. diff --git a/.github/workflows/agent-evals.yml b/.github/workflows/agent-evals.yml new file mode 100644 index 00000000000..b319af2d6e5 --- /dev/null +++ b/.github/workflows/agent-evals.yml @@ -0,0 +1,37 @@ +name: Agent evals + +on: + pull_request: + branches: + - next + paths: + - packages/shared/docs/agent-onboarding.md + - libs/agent-evals/** + - .github/workflows/agent-evals.yml + +jobs: + evals: + runs-on: ubuntu-latest + timeout-minutes: 45 + steps: + - name: Checkout + uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5 + + - name: Setup pnpm + uses: pnpm/action-setup@0e279bb959325dab635dd2c09392533439d90093 # v6.0.8 + with: + version: 11.0.9 + + - name: Setup Node.js + uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4 + with: + node-version: 22 + cache: pnpm + + - name: Install dependencies + run: pnpm install --frozen-lockfile + + - name: Run agent evals + env: + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} + run: pnpm --filter @novu/agent-evals eval src/suites/agent-onboarding diff --git a/apps/api/src/app/agents/conversation-runtime/conversation/billing-activation-rules.spec.ts b/apps/api/src/app/agents/conversation-runtime/conversation/billing-activation-rules.spec.ts index 6bcdaf84d68..d2eebb960ee 100644 --- a/apps/api/src/app/agents/conversation-runtime/conversation/billing-activation-rules.spec.ts +++ b/apps/api/src/app/agents/conversation-runtime/conversation/billing-activation-rules.spec.ts @@ -1,9 +1,9 @@ import { type ActivationRuleParams, buildActivationOrConditions, - classifyActivationReason, ConversationActivationReasonEnum, type ConversationBillingState, + classifyActivationReason, } from '@novu/dal'; import { expect } from 'chai'; @@ -46,10 +46,26 @@ describe('billing-activation-rules #novu-v2', () => { const cases: Array<{ name: string; billing: ConversationBillingState | undefined }> = [ { name: 'no billing (brand new)', billing: undefined }, { name: 'empty billing', billing: {} }, - { name: 'counted this period, recent engagement', billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z' } }, - { name: 'counted this period, stale engagement', billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-01T00:00:00.000Z' } }, - { name: 'counted a previous period', billing: { lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' } }, - { name: 'resolved since last count', billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z', resolvedAt: '2026-06-21T00:00:00.000Z' } }, + { + name: 'counted this period, recent engagement', + billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z' }, + }, + { + name: 'counted this period, stale engagement', + billing: { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-01T00:00:00.000Z' }, + }, + { + name: 'counted a previous period', + billing: { lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' }, + }, + { + name: 'resolved since last count', + billing: { + lastCountedPeriodKey: PERIOD, + lastEngagementAt: '2026-06-20T00:00:00.000Z', + resolvedAt: '2026-06-21T00:00:00.000Z', + }, + }, { name: 'counted, no engagement timestamp', billing: { lastCountedPeriodKey: PERIOD } }, ]; @@ -68,13 +84,20 @@ describe('billing-activation-rules #novu-v2', () => { // resolved wins over an otherwise-quiet, same-period conversation expect( classifyActivationReason( - { lastCountedPeriodKey: PERIOD, lastEngagementAt: '2026-06-20T00:00:00.000Z', resolvedAt: '2026-06-21T00:00:00.000Z' }, + { + lastCountedPeriodKey: PERIOD, + lastEngagementAt: '2026-06-20T00:00:00.000Z', + resolvedAt: '2026-06-21T00:00:00.000Z', + }, params ) ).to.equal(ConversationActivationReasonEnum.REOPEN); expect( - classifyActivationReason({ lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' }, params) + classifyActivationReason( + { lastCountedPeriodKey: '2026-05', lastEngagementAt: '2026-06-20T00:00:00.000Z' }, + params + ) ).to.equal(ConversationActivationReasonEnum.NEW_CYCLE); expect( diff --git a/apps/api/src/app/agents/conversation-runtime/conversation/conversation-activation.service.ts b/apps/api/src/app/agents/conversation-runtime/conversation/conversation-activation.service.ts index e27c17bdb13..d6eaf3bf87d 100644 --- a/apps/api/src/app/agents/conversation-runtime/conversation/conversation-activation.service.ts +++ b/apps/api/src/app/agents/conversation-runtime/conversation/conversation-activation.service.ts @@ -1,14 +1,19 @@ import { Injectable } from '@nestjs/common'; import { ModuleRef } from '@nestjs/core'; -import { AgentEntitlementsService, AnalyticsService, PinoLogger, throwPlanLimitExceeded } from '@novu/application-generic'; import { - classifyActivationReason, + AgentEntitlementsService, + AnalyticsService, + PinoLogger, + throwPlanLimitExceeded, +} from '@novu/application-generic'; +import { CommunityOrganizationRepository, ConversationActivationReasonEnum, ConversationActivationRepository, ConversationEntity, ConversationRepository, ConversationThreadKindEnum, + classifyActivationReason, } from '@novu/dal'; import { ApiServiceLevelEnum, UNLIMITED_VALUE } from '@novu/shared'; import { @@ -417,7 +422,10 @@ export class ConversationActivationService { return; } - const currentCount = await this.activationRepository.countForOrganizationPeriod(context.organizationId, periodKey); + const currentCount = await this.activationRepository.countForOrganizationPeriod( + context.organizationId, + periodKey + ); if (currentCount >= limit) { trackAgentActiveConversationLimitReached(this.analyticsService, { organizationId: context.organizationId, diff --git a/apps/api/src/app/agents/conversation-runtime/egress/outbound.gateway.ts b/apps/api/src/app/agents/conversation-runtime/egress/outbound.gateway.ts index 38638bfc59f..c10c2dd1250 100644 --- a/apps/api/src/app/agents/conversation-runtime/egress/outbound.gateway.ts +++ b/apps/api/src/app/agents/conversation-runtime/egress/outbound.gateway.ts @@ -7,10 +7,7 @@ import { AgentConfigResolver, ResolvedAgentConfig } from '../../channels/agent-c import type { ReplyContentDto } from '../../shared/dtos/agent-reply-payload.dto'; import { AgentPlatformEnum } from '../../shared/enums/agent-platform.enum'; import { esmImport } from '../../shared/util/esm-import'; -import { - buildPoweredByWatermark, - contentHasPoweredByWatermark, -} from '../../shared/util/novu-powered-by-watermark'; +import { buildPoweredByWatermark, contentHasPoweredByWatermark } from '../../shared/util/novu-powered-by-watermark'; import { type AgentActionTokenBinding, AgentActionTokenService } from '../action-token/agent-action-token.service'; import { AgentConversationService } from '../conversation/agent-conversation.service'; import { ChatInstanceRegistry } from '../ingress/chat-instance.registry'; diff --git a/apps/api/src/app/agents/e2e/active-conversations.e2e.ts b/apps/api/src/app/agents/e2e/active-conversations.e2e.ts index 9222b77cd16..fa8844cb758 100644 --- a/apps/api/src/app/agents/e2e/active-conversations.e2e.ts +++ b/apps/api/src/app/agents/e2e/active-conversations.e2e.ts @@ -1,9 +1,5 @@ import { AgentEntitlementsService } from '@novu/application-generic'; -import { - CommunityOrganizationRepository, - ConversationActivationRepository, - ConversationStatusEnum, -} from '@novu/dal'; +import { CommunityOrganizationRepository, ConversationActivationRepository, ConversationStatusEnum } from '@novu/dal'; import { ApiServiceLevelEnum } from '@novu/shared'; import { testServer } from '@novu/testing'; import { expect } from 'chai'; @@ -92,10 +88,7 @@ describe('Active Conversations metering - inbound flow #novu-v2', () => { } async function setServiceLevel(apiServiceLevel: ApiServiceLevelEnum, isTrial = false): Promise { - await organizationRepository.update( - { _id: ctx.session.organization._id }, - { $set: { apiServiceLevel, isTrial } } - ); + await organizationRepository.update({ _id: ctx.session.organization._id }, { $set: { apiServiceLevel, isTrial } }); } async function invokeSlack(threadId: string, text: string, opts: { isDM?: boolean; userId?: string } = {}) { @@ -181,7 +174,11 @@ describe('Active Conversations metering - inbound flow #novu-v2', () => { // Rewind the last engagement past the 24h WhatsApp window. const stale = new Date(Date.now() - 25 * 60 * 60 * 1000).toISOString(); await conversationRepository.update( - { _id: conversation!._id, _environmentId: ctx.session.environment._id, _organizationId: ctx.session.organization._id }, + { + _id: conversation!._id, + _environmentId: ctx.session.environment._id, + _organizationId: ctx.session.organization._id, + }, { $set: { 'billing.lastEngagementAt': stale } } ); @@ -196,7 +193,11 @@ describe('Active Conversations metering - inbound flow #novu-v2', () => { // Only 2h elapsed — inside the 24h WhatsApp window. const recent = new Date(Date.now() - 2 * 60 * 60 * 1000).toISOString(); await conversationRepository.update( - { _id: conversation!._id, _environmentId: ctx.session.environment._id, _organizationId: ctx.session.organization._id }, + { + _id: conversation!._id, + _environmentId: ctx.session.environment._id, + _organizationId: ctx.session.organization._id, + }, { $set: { 'billing.lastEngagementAt': recent } } ); @@ -214,7 +215,11 @@ describe('Active Conversations metering - inbound flow #novu-v2', () => { const conversation = await findConversation(threadId); // Pretend it was last counted in a previous period. await conversationRepository.update( - { _id: conversation!._id, _environmentId: ctx.session.environment._id, _organizationId: ctx.session.organization._id }, + { + _id: conversation!._id, + _environmentId: ctx.session.environment._id, + _organizationId: ctx.session.organization._id, + }, { $set: { 'billing.lastCountedPeriodKey': '2000-01' } } ); @@ -333,7 +338,9 @@ describe('Active Conversations metering - inbound flow #novu-v2', () => { await invokeSlack(`T_STRIPE_${Date.now()}`, 'billed'); // Activation was recorded against the Stripe period key, not the calendar month. - expect(await activationRepository.countForOrganizationPeriod(ctx.session.organization._id, periodKey)).to.equal(1); + expect(await activationRepository.countForOrganizationPeriod(ctx.session.organization._id, periodKey)).to.equal( + 1 + ); const res = await ctx.session.testAgent.get('/v1/agents/usage/conversations'); expect(res.status).to.equal(200); diff --git a/apps/api/src/app/agents/management/agents.controller.ts b/apps/api/src/app/agents/management/agents.controller.ts index 7d3c0b0d8cc..7ec43f5202e 100644 --- a/apps/api/src/app/agents/management/agents.controller.ts +++ b/apps/api/src/app/agents/management/agents.controller.ts @@ -28,6 +28,7 @@ import { } from '../../shared/framework/response.decorator'; import { KeylessAccessible } from '../../shared/framework/swagger/keyless.security'; import { UserSession } from '../../shared/framework/user.decorator'; +import { ConversationActivationService } from '../conversation-runtime/conversation/conversation-activation.service'; import { AgentRuntimeExceptionFilter } from '../shared/agent-runtime-exception.filter'; import { AgentResponseDto, @@ -38,7 +39,6 @@ import { UpdateAgentBridgeRequestDto, UpdateAgentRequestDto, } from '../shared/dtos'; -import { ConversationActivationService } from '../conversation-runtime/conversation/conversation-activation.service'; import { type AgentEmojiEntry, ListAgentEmoji } from '../shared/emoji/list-agent-emoji/list-agent-emoji.usecase'; import { CreateAgentCommand } from './usecases/create-agent/create-agent.command'; import { CreateAgent } from './usecases/create-agent/create-agent.usecase'; diff --git a/apps/api/src/app/agents/mcp/connections/mcp-connect-redirect.service.spec.ts b/apps/api/src/app/agents/mcp/connections/mcp-connect-redirect.service.spec.ts index 815cfb64e81..4002048ffc2 100644 --- a/apps/api/src/app/agents/mcp/connections/mcp-connect-redirect.service.spec.ts +++ b/apps/api/src/app/agents/mcp/connections/mcp-connect-redirect.service.spec.ts @@ -121,8 +121,6 @@ describe('McpConnectRedirectService', () => { it('buildMcpConnectRedirectUrl encodes the token in the public path', () => { process.env.API_ROOT_URL = 'https://api.example.com/'; - expect(buildMcpConnectRedirectUrl('abc123')).to.equal( - `https://api.example.com${MCP_CONNECT_REDIRECT_PATH}/abc123` - ); + expect(buildMcpConnectRedirectUrl('abc123')).to.equal(`https://api.example.com${MCP_CONNECT_REDIRECT_PATH}/abc123`); }); }); diff --git a/apps/api/src/app/agents/mcp/oauth/agents-mcp-oauth.controller.spec.ts b/apps/api/src/app/agents/mcp/oauth/agents-mcp-oauth.controller.spec.ts index 4499c5bc3fc..4ddc0804fb0 100644 --- a/apps/api/src/app/agents/mcp/oauth/agents-mcp-oauth.controller.spec.ts +++ b/apps/api/src/app/agents/mcp/oauth/agents-mcp-oauth.controller.spec.ts @@ -34,8 +34,9 @@ describe('AgentsMcpOAuthController connect redirect', () => { await controller.getConnectRedirect(res as any, 'short-token'); expect(mcpConnectRedirect.resolve.calledOnceWithExactly('short-token')).to.equal(true); - expect(res.redirect.calledOnceWithExactly(HttpStatus.FOUND, 'https://provider.example/oauth/authorize?state=abc')) - .to.equal(true); + expect( + res.redirect.calledOnceWithExactly(HttpStatus.FOUND, 'https://provider.example/oauth/authorize?state=abc') + ).to.equal(true); expect(res.send.called).to.equal(false); }); diff --git a/apps/api/src/app/agents/mcp/oauth/agents-mcp-oauth.controller.ts b/apps/api/src/app/agents/mcp/oauth/agents-mcp-oauth.controller.ts index a0292bbb2de..d0bce9ef348 100644 --- a/apps/api/src/app/agents/mcp/oauth/agents-mcp-oauth.controller.ts +++ b/apps/api/src/app/agents/mcp/oauth/agents-mcp-oauth.controller.ts @@ -6,8 +6,8 @@ import { Response } from 'express'; import { ThrottlerCategory } from '../../../rate-limiting/guards'; import { renderConnectionResultPage } from '../../../shared/html/connection-result-page'; import { CompleteProviderManagedRedirect } from '../connections/ensure-provider-managed-vault/complete-provider-managed-redirect.usecase'; -import { McpConnectRedirectService } from '../connections/mcp-connect-redirect.service'; import { PROVIDER_MANAGED_REDIRECT_PATH } from '../connections/ensure-provider-managed-vault/provider-managed-redirect-state'; +import { McpConnectRedirectService } from '../connections/mcp-connect-redirect.service'; import { McpOAuthCallbackCommand } from './mcp-oauth-callback/mcp-oauth-callback.command'; import { McpOAuthCallback } from './mcp-oauth-callback/mcp-oauth-callback.usecase'; import { renderExpiredMcpSetupLinkPage, sendMcpOAuthResultPage } from './mcp-oauth-result-page.util'; @@ -113,8 +113,8 @@ export class AgentsMcpOAuthController { err.message.includes('redirect state expired'); const isNotFound = err instanceof NotFoundException; - let title = 'Connection failed'; - let heading = "We couldn't connect"; + const title = 'Connection failed'; + const heading = "We couldn't connect"; let message = 'Something went wrong while opening the provider connection. Send a new message to your agent and try again.'; diff --git a/apps/dashboard/src/api/agents.ts b/apps/dashboard/src/api/agents.ts index 34f967b7be4..8791a3eecd0 100644 --- a/apps/dashboard/src/api/agents.ts +++ b/apps/dashboard/src/api/agents.ts @@ -250,7 +250,10 @@ export type ConversationUsage = { periodEnd: string; }; -export async function getConversationUsage(environment: IEnvironment, signal?: AbortSignal): Promise { +export async function getConversationUsage( + environment: IEnvironment, + signal?: AbortSignal +): Promise { const response = await get<{ data: ConversationUsage } | ConversationUsage>('/agents/usage/conversations', { environment, signal, diff --git a/libs/agent-evals/.env.example b/libs/agent-evals/.env.example new file mode 100644 index 00000000000..3a8cf17359c --- /dev/null +++ b/libs/agent-evals/.env.example @@ -0,0 +1,5 @@ +ANTHROPIC_API_KEY= + +# Optional model overrides (default: claude-sonnet-4-5) +NOVU_EVAL_MODEL= +NOVU_EVAL_JUDGE_MODEL= diff --git a/libs/agent-evals/.gitignore b/libs/agent-evals/.gitignore new file mode 100644 index 00000000000..5bf39403e34 --- /dev/null +++ b/libs/agent-evals/.gitignore @@ -0,0 +1,4 @@ +debug-runs/ +scores-*.json +.vitest-evals/ +.env diff --git a/libs/agent-evals/README.md b/libs/agent-evals/README.md new file mode 100644 index 00000000000..ea09f6cec35 --- /dev/null +++ b/libs/agent-evals/README.md @@ -0,0 +1,188 @@ +# @novu/agent-evals + +Behavioral eval harness for Novu coding-agent playbooks. Runs a real LLM agent against scripted scenarios with a mocked CLI, then grades whether the agent follows the playbook using deterministic structural checks plus optional LLM-as-judge graders for fuzzy criteria. + +The harness is **suite-based**: `src/core/` holds the playbook-agnostic simulation layer (mock tools, tape replay, recorder), and each suite under `src/suites/` plugs in its system prompt, command parser, scenarios, and grader catalog. Scoring and reporting are handled by [vitest-evals](https://vitest-evals.sentry.dev/). + +The first suite, `agent-onboarding`, tests `@novu/shared/docs/agent-onboarding.md` (the `npx novu connect` flow), resolved via the `@novu/shared` package export. + +## Architecture + +### Layer overview + +```mermaid +flowchart TB + subgraph entry["Entry (vitest)"] + Eval["onboarding.eval.ts\ndescribeEval per scenario"] + Adapters["adapters.ts\ngrader → judge"] + end + + subgraph core["Core simulation (src/core/)"] + Harness["harness.ts\ncreateHarness + AI SDK loop"] + Tools["tools.ts\nBash · BashOutput · AskUserQuestion · Read"] + MockShell["mock-shell.ts\nTape replay engine"] + Recorder["recorder.ts\nRunResult builder"] + Graders["graders.ts\ndefineGraders · contains · judge"] + Judge["judge.ts\nLLM-as-judge"] + end + + subgraph suite["Suite (src/suites/agent-onboarding/)"] + SuiteObj["index.ts\nSuite contract"] + Scenarios["scenarios/{id}/\nscenario.ts · graders.ts · project/"] + Parser["connect-parser.ts"] + Tape["tape.ts"] + Catalog["catalog.ts"] + end + + Eval --> Harness + Eval --> Adapters + Adapters --> Graders + Adapters --> Judge + Harness --> Tools + Tools --> MockShell + Tools --> Recorder + Harness --> Recorder + SuiteObj --> Harness + Parser --> MockShell + Tape --> MockShell + Scenarios --> Eval + Catalog --> Scenarios +``` + +### Execution flow + +Each scenario is a vitest-evals `describeEval` block: one harness run, then automatic judges score the resulting `RunResult`. + +```mermaid +sequenceDiagram + participant Vitest as vitest-evals + participant Harness as harness.ts + participant LLM as Anthropic model + participant Tools as Harness tools + participant Shell as MockShellEngine + participant Rec as RunRecorder + participant Judges as adapters.ts + + Vitest->>Harness: run(userPrompt) + Harness->>Harness: resolveSystemPrompt(playbook doc) + Harness->>LLM: generateText(system + user prompt, tools) + loop tool-calling steps + LLM->>Tools: Bash / BashOutput / AskUserQuestion / Read + alt tracked command (e.g. novu connect) + Tools->>Shell: createShell → replay tape chunks + Shell-->>Tools: scripted stdout + Tools->>Rec: record tracked command, URLs, polls + else AskUserQuestion + Tools->>Rec: pick scriptedAnswers[answerIndex] + else Read fixture + Tools->>Rec: read scenario project/ files + end + Tools-->>LLM: tool result + end + opt followUpMessages / followUpOnOptionId + Harness->>LLM: inject scripted user follow-up + end + Harness->>Rec: build() → RunResult + Harness-->>Vitest: HarnessRun with output + Vitest->>Judges: assess each grader as judge (threshold 0.8) + alt judge grader + Judges->>LLM: runJudge(prompt, context) + end + Judges-->>Vitest: pass / fail per judge +``` + +### Key concepts + +| Concept | Role | +| --- | --- | +| **Suite** | Plugs a playbook (system prompt), `CommandParser`, scenario list, and optional hooks into the harness. | +| **Scenario** | One eval case: user prompt, fixture `project/`, scripted user answers, optional CLI **tape**, and follow-up messages. | +| **Tape** | Ordered stdout chunks replayed when the agent runs a tracked command; `when(parsed)` can branch on parsed flags. | +| **CommandParser** | Decides which shell commands are tracked (e.g. `novu connect`) and parses them for tape selection and validation. | +| **RunResult** | Everything the agent did: tool calls, assistant text, captured URLs, polled/killed shells, suite metadata. | +| **Graders / judges** | **Deterministic** checks on `RunResult`, or **judge** graders that call a second LLM pass. Adapted to vitest-evals `createJudge` via `adapters.ts`. | + +## Structure + +```text +src/ + core/ # suite-agnostic simulation + types.ts # Suite contract, RunResult, Tape, CommandParser + tools.ts # Bash / BashOutput / AskUserQuestion / Read + mock-shell.ts # tape replay engine + recorder.ts # RunResult builder + graders.ts # defineGraders, contains, matches, judge + judge.ts # LLM-as-judge (Anthropic via AI SDK) + suites/ + agent-onboarding/ + index.ts # the Suite object + harness.ts # createHarness + multi-turn agent loop + adapters.ts # grader → vitest-evals judge + onboarding.eval.ts # describeEval per scenario + connect-parser.ts + tape.ts + catalog.ts + graders.test.ts # synthetic RunResult unit tests + scenarios// # scenario.ts + graders.ts + project/ fixtures +vitest.config.ts # unit tests (*.test.ts) +vitest.evals.config.ts # evals (*.eval.ts) + vitest-evals reporter +``` + +## Setup + +```bash +cp .env.example .env # from libs/agent-evals/ +pnpm install +``` + +Set `ANTHROPIC_API_KEY` in `.env` before running evals. Eval suites skip automatically when the key is missing. + +## Local commands + +**Unit tests** (no API key — synthetic `RunResult` grader checks): + +```bash +pnpm --filter @novu/agent-evals test +``` + +**Evals** (requires `ANTHROPIC_API_KEY`): + +```bash +pnpm --filter @novu/agent-evals eval +pnpm --filter @novu/agent-evals eval:watch + +# Single scenario +pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t keyless-slack-secure +``` + +## Environment variables + +| Variable | Description | +| --- | --- | +| `ANTHROPIC_API_KEY` | Required for eval runs (suites skip when unset) | +| `NOVU_EVAL_MODEL` | Agent model (default: `claude-sonnet-4-5`) | +| `NOVU_EVAL_JUDGE_MODEL` | Judge model (default: `claude-sonnet-4-5`) | +| `NOVU_EVAL_CONCURRENCY` | Max scenarios run in parallel (default: `4`) | +| `NOVU_EVAL_MAX_STEPS` | Max agent steps per scenario run (default: `40`) | + +Scenarios are independent and dominated by live-model latency, so they run concurrently (`sequence.concurrent`). Raise `NOVU_EVAL_CONCURRENCY` for faster runs or lower it if you hit Anthropic rate limits. + +## Threshold semantics + +Each scenario uses `judgeThreshold: 0.8` — the average judge score for that scenario must be ≥ 80%. This is stricter than the old global `--fail-under 80` (which gated on the average across all scenarios): every scenario must pass individually. + +Judge graders (LLM-as-judge) always run alongside deterministic graders. + +## Triage failing scenarios + +When a scenario fails, use the Cursor skill `triage-agent-eval-failures` (`.cursor/skills/triage-agent-eval-failures/`) to decide whether the failure is real (playbook regression), a test bug (grader / tape / judge), or flaky (model non-determinism). The skill walks through re-run checks, `RunResult` evidence, and a fix target — playbook vs test scaffolding. Worked examples are in `reference.md` inside that skill directory. + +## Adding a new suite + +1. Create `src/suites//` with a `CommandParser`, scenario folders, grader catalog, and `harness.ts`. +2. Export a `Suite` object from `index.ts`. +3. Add `.eval.ts` that loops scenarios and registers `describeEval` blocks. + +## CI + +GitHub Actions workflow `.github/workflows/agent-evals.yml` runs `pnpm --filter @novu/agent-evals eval` on PRs to `next` that touch the playbook or harness. diff --git a/libs/agent-evals/package.json b/libs/agent-evals/package.json new file mode 100644 index 00000000000..2091230f935 --- /dev/null +++ b/libs/agent-evals/package.json @@ -0,0 +1,27 @@ +{ + "name": "@novu/agent-evals", + "version": "0.1.0", + "private": true, + "description": "Behavioral eval harness for Novu coding-agent playbooks (suite-based).", + "type": "module", + "scripts": { + "eval": "vitest run --config vitest.evals.config.ts", + "eval:watch": "vitest --config vitest.evals.config.ts", + "test": "vitest run --config vitest.config.ts", + "check": "biome check .", + "check:fix": "biome check --write ." + }, + "dependencies": { + "@novu/shared": "workspace:*", + "@ai-sdk/anthropic": "^3.0.10", + "ai": "6.0.50", + "dotenv": "^16.6.1", + "zod": "^3.23.8" + }, + "devDependencies": { + "@types/node": "^22.0.0", + "typescript": "5.6.2", + "vitest": "^4.1.8", + "vitest-evals": "0.12.0" + } +} diff --git a/libs/agent-evals/project.json b/libs/agent-evals/project.json new file mode 100644 index 00000000000..77bcd43f91c --- /dev/null +++ b/libs/agent-evals/project.json @@ -0,0 +1,25 @@ +{ + "name": "@novu/agent-evals", + "sourceRoot": "libs/agent-evals/src", + "projectType": "library", + "targets": { + "lint": { + "executor": "nx:run-commands", + "options": { + "command": "npx biome lint libs/agent-evals" + } + }, + "eval": { + "executor": "nx:run-commands", + "options": { + "command": "pnpm --filter @novu/agent-evals eval" + } + }, + "test": { + "executor": "nx:run-commands", + "options": { + "command": "pnpm --filter @novu/agent-evals test" + } + } + } +} diff --git a/libs/agent-evals/scripts/run-evals.sh b/libs/agent-evals/scripts/run-evals.sh new file mode 100755 index 00000000000..2cee1ce33a9 --- /dev/null +++ b/libs/agent-evals/scripts/run-evals.sh @@ -0,0 +1,7 @@ +#!/usr/bin/env bash +set -euo pipefail + +ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" +cd "$ROOT_DIR" + +pnpm eval "$@" diff --git a/libs/agent-evals/src/core/graders.test.ts b/libs/agent-evals/src/core/graders.test.ts new file mode 100644 index 00000000000..0783186d8e3 --- /dev/null +++ b/libs/agent-evals/src/core/graders.test.ts @@ -0,0 +1,25 @@ +import { describe, expect, it } from 'vitest'; +import { transcriptText } from './graders.js'; +import { RunRecorder } from './recorder.js'; + +describe('transcriptText', () => { + it('does not duplicate the final assistant turn', () => { + const recorder = new RunRecorder('s', 'prompt'); + recorder.recordAssistantMessage('first turn'); + recorder.recordAssistantMessage('final turn'); + + const result = recorder.build(); + + // finalText mirrors the last assistant message, so the transcript must contain + // "final turn" exactly once. + expect(result.finalText).toBe('final turn'); + expect(transcriptText(result)).toBe('first turn\nfinal turn'); + expect(transcriptText(result).match(/final turn/g)).toHaveLength(1); + }); + + it('is empty when no assistant messages were recorded', () => { + const result = new RunRecorder('s', 'prompt').build(); + + expect(transcriptText(result)).toBe(''); + }); +}); diff --git a/libs/agent-evals/src/core/graders.ts b/libs/agent-evals/src/core/graders.ts new file mode 100644 index 00000000000..607e08172d8 --- /dev/null +++ b/libs/agent-evals/src/core/graders.ts @@ -0,0 +1,75 @@ +import { runJudge } from './judge.js'; +import type { GraderDefinition, GraderFn, GraderOutcome, RunResult, ToolCallRecord } from './types.js'; + +/** Helper for graders that want to explain a failure inline. */ +export function fail(reason: string): GraderOutcome { + return { status: 'fail', reason }; +} + +export function labeled(label: string, input: GraderFn | GraderDefinition): GraderDefinition { + if (typeof input === 'function') { + return { kind: 'deterministic', run: input, label }; + } + + return { ...input, label }; +} + +export function defineGraders>( + graders: T +): Record { + const normalized = {} as Record; + + for (const [name, value] of Object.entries(graders) as Array<[keyof T, GraderFn | GraderDefinition]>) { + if (typeof value === 'function') { + normalized[name] = { kind: 'deterministic', run: value }; + } else { + normalized[name] = value; + } + } + + return normalized; +} + +export function contains(substring: string, source: (result: RunResult) => string): GraderFn { + return (result) => (source(result).toLowerCase().includes(substring.toLowerCase()) ? 'pass' : 'fail'); +} + +export function notContains(substring: string, source: (result: RunResult) => string): GraderFn { + return (result) => (!source(result).toLowerCase().includes(substring.toLowerCase()) ? 'pass' : 'fail'); +} + +export function containsAny(substrings: string[], source: (result: RunResult) => string): GraderFn { + return (result) => { + const haystack = source(result).toLowerCase(); + + return substrings.some((item) => haystack.includes(item.toLowerCase())) ? 'pass' : 'fail'; + }; +} + +export function matches(pattern: RegExp, source: (result: RunResult) => string): GraderFn { + return (result) => (pattern.test(source(result)) ? 'pass' : 'fail'); +} + +export function toolCallsNamed(result: RunResult, name: string): ToolCallRecord[] { + return result.toolCalls.filter((call) => call.name === name); +} + +export function transcriptText(result: RunResult): string { + // The recorder mirrors the last assistant turn into `finalText`, so appending it again + // would duplicate that turn in judge prompts and regex-match contexts. Only include + // `finalText` when it is not already the last recorded message. + const messages = [...result.assistantMessages]; + + if (result.finalText && messages[messages.length - 1] !== result.finalText) { + messages.push(result.finalText); + } + + return messages.join('\n'); +} + +export function judge(prompt: string, context: (result: RunResult) => string): GraderDefinition { + return { + kind: 'judge', + run: async (result) => runJudge(prompt, context(result)), + }; +} diff --git a/libs/agent-evals/src/core/judge.ts b/libs/agent-evals/src/core/judge.ts new file mode 100644 index 00000000000..ea463340e05 --- /dev/null +++ b/libs/agent-evals/src/core/judge.ts @@ -0,0 +1,47 @@ +import { anthropic } from '@ai-sdk/anthropic'; +import { generateText } from 'ai'; +import type { GraderOutcome, GraderResult } from './types.js'; + +const DEFAULT_JUDGE_MODEL = 'claude-sonnet-4-5'; + +export async function runJudge(prompt: string, context: string, options?: { model?: string }): Promise { + if (!process.env.ANTHROPIC_API_KEY) { + return { status: 'skip' }; + } + + const model = options?.model ?? process.env.NOVU_EVAL_JUDGE_MODEL ?? DEFAULT_JUDGE_MODEL; + + const result = await generateText({ + model: anthropic(model), + prompt: [ + 'You are grading an AI agent run against a coding-agent playbook.', + 'First, write one sentence of reasoning explaining your verdict.', + 'Then, on the final line, answer with exactly YES, NO, or UNKNOWN.', + 'Answer UNKNOWN only if the context does not contain enough information to judge the question.', + '', + `Question: ${prompt}`, + '', + 'Context:', + context, + ].join('\n'), + }); + + const lines = result.text + .trim() + .split('\n') + .map((line) => line.trim()) + .filter((line) => line.length > 0); + + const verdictLine = lines.at(-1) ?? ''; + const verdict = verdictLine.toUpperCase(); + const reason = lines.slice(0, -1).join(' ').trim() || undefined; + + // Escape hatch: a starved judge abstains instead of counting as a failure. + if (verdict.startsWith('UNKNOWN')) { + return { status: 'skip' }; + } + + const status: GraderResult = verdict.startsWith('YES') ? 'pass' : 'fail'; + + return status === 'fail' ? { status, reason } : { status }; +} diff --git a/libs/agent-evals/src/core/mock-shell.test.ts b/libs/agent-evals/src/core/mock-shell.test.ts new file mode 100644 index 00000000000..09467833c41 --- /dev/null +++ b/libs/agent-evals/src/core/mock-shell.test.ts @@ -0,0 +1,57 @@ +import { describe, expect, it } from 'vitest'; +import { MockShellEngine } from './mock-shell.js'; +import type { CommandParser, EvalScenario } from './types.js'; + +type Flags = { token?: string }; + +const parser: CommandParser = { + matches: (command) => /\bconnect\b/.test(command), + parse: (command) => ({ token: /--slack-config-token\b/.test(command) ? 'xoxe' : undefined }), +}; + +function scenario(): EvalScenario { + return { + id: 'pending-shell', + category: 'test', + description: '', + userPrompt: '', + projectRoot: '/tmp', + scriptedAnswers: [], + tape: { + chunks: [{ stdout: 'NOVU_CONNECT_SLACK_SETUP_URL=https://setup.test' }], + exitCode: 0, + pendingWhen: (flags) => !flags.token, + }, + }; +} + +describe('MockShellEngine pendingWhen', () => { + it('keeps a pending (no-token) shell running until it is killed', () => { + const engine = new MockShellEngine(scenario(), parser); + const shell = engine.createShell('novu connect', true, {}); + + // Drain every chunk; a pending branch must not auto-complete. + engine.pollShell(shell.id); + engine.pollShell(shell.id); + engine.pollShell(shell.id); + + expect(shell.exitCode).toBeNull(); + expect(shell.completed).toBe(false); + + engine.killShell(shell.id); + + expect(shell.completed).toBe(true); + expect(shell.killed).toBe(true); + }); + + it('completes a non-pending (token) shell after its chunks are emitted', () => { + const engine = new MockShellEngine(scenario(), parser); + const shell = engine.createShell('novu connect --slack-config-token xoxe', true, {}); + + engine.pollShell(shell.id); + engine.pollShell(shell.id); + + expect(shell.exitCode).toBe(0); + expect(shell.completed).toBe(true); + }); +}); diff --git a/libs/agent-evals/src/core/mock-shell.ts b/libs/agent-evals/src/core/mock-shell.ts new file mode 100644 index 00000000000..55e264065ae --- /dev/null +++ b/libs/agent-evals/src/core/mock-shell.ts @@ -0,0 +1,136 @@ +import type { CommandParser, EvalScenario, MockShellState, ParsedCommand, Tape } from './types.js'; + +function selectTapeChunks(tape: Tape, parsed: TParsed): string[] { + const selected: string[] = []; + + for (const chunk of tape.chunks) { + if (chunk.when && !chunk.when(parsed)) { + continue; + } + + selected.push(chunk.stdout); + } + + return selected; +} + +/** + * Replays a scripted CLI "tape" across background-shell polls. The suite's + * CommandParser decides which commands are "tracked" (e.g. `novu connect`) and + * how to parse them; the scenario's tape supplies the stdout chunks and + * optional validation. + */ +export class MockShellEngine { + private shells = new Map>(); + private shellCounter = 0; + + constructor( + private readonly scenario: EvalScenario, + private readonly parser: CommandParser + ) {} + + createShell(command: string, runInBackground: boolean, env: Record): MockShellState { + this.shellCounter += 1; + const id = `shell-${this.shellCounter}`; + const isTracked = this.parser.matches(command); + + let parsed: TParsed | null = null; + let parseError: string | null = null; + + if (isTracked) { + try { + parsed = this.parser.parse(command, env); + } catch (error) { + parseError = error instanceof Error ? error.message : String(error); + } + } + + let chunks: string[] = []; + let exitCode: number | null = null; + + if (isTracked && parseError) { + chunks = [`✗ Failed to parse tracked command: ${parseError}`]; + exitCode = 1; + } else if (isTracked && parsed !== null && this.scenario.tape) { + const validationError = this.scenario.tape.validate?.(parsed) ?? null; + + if (validationError) { + chunks = [`✗ ${validationError}`]; + exitCode = 1; + } else { + chunks = selectTapeChunks(this.scenario.tape, parsed); + // A pending branch keeps the shell running (exitCode null) until it is killed, + // so `pollShell` never marks it completed on its own. + exitCode = this.scenario.tape.pendingWhen?.(parsed) ? null : (this.scenario.tape.exitCode ?? 0); + } + } else if (isTracked && !this.scenario.tape) { + chunks = ['✗ Tracked command was not expected for this scenario.']; + exitCode = 1; + } else if (!runInBackground) { + chunks = [`Executed: ${command}`]; + exitCode = 0; + } else { + chunks = [`Background process started: ${command}`]; + exitCode = null; + } + + const shell: MockShellState = { + id, + command, + parsed, + isTracked, + chunks, + emittedStdout: [], + chunkIndex: 0, + exitCode, + completed: false, + killed: false, + }; + + this.shells.set(id, shell); + + return shell; + } + + pollShell(shellId: string): MockShellState | null { + const shell = this.shells.get(shellId); + + if (!shell || shell.killed) { + return shell ?? null; + } + + if (shell.chunkIndex < shell.chunks.length) { + const nextChunk = shell.chunks[shell.chunkIndex]; + shell.emittedStdout.push(nextChunk); + shell.chunkIndex += 1; + } + + if (shell.chunkIndex >= shell.chunks.length && shell.exitCode !== null) { + shell.completed = true; + } + + return shell; + } + + killShell(shellId: string): boolean { + const shell = this.shells.get(shellId); + + if (!shell) { + return false; + } + + shell.killed = true; + shell.completed = true; + shell.exitCode = shell.exitCode ?? 143; + + return true; + } + + getShell(shellId: string): MockShellState | undefined { + return this.shells.get(shellId); + } + + listShells(): Array> { + return [...this.shells.values()]; + } +} diff --git a/libs/agent-evals/src/core/recorder.ts b/libs/agent-evals/src/core/recorder.ts new file mode 100644 index 00000000000..28ca869ad0b --- /dev/null +++ b/libs/agent-evals/src/core/recorder.ts @@ -0,0 +1,158 @@ +import type { MockShellState, RunResult, ToolCallRecord } from './types.js'; + +export class RunRecorder { + private toolCalls: ToolCallRecord[] = []; + private assistantMessages: string[] = []; + private finalText = ''; + private capturedUrls: string[] = []; + private openedFiles: string[] = []; + private killedShellIds: string[] = []; + private trackedShellIds: string[] = []; + private polledShellIds: string[] = []; + private trackedCommands: string[] = []; + private metadata: Record = {}; + + constructor( + private readonly scenarioId: string, + private readonly userPrompt: string + ) {} + + recordToolCall(name: string, args: Record, result?: unknown): void { + this.toolCalls.push({ name, args, result, timestamp: Date.now() }); + } + + recordAssistantMessage(text: string): void { + if (text.trim()) { + this.assistantMessages.push(text); + this.finalText = text; + } + } + + recordTrackedCommand(command: string): void { + this.trackedCommands.push(command); + } + + setMetadata(key: string, value: unknown): void { + this.metadata[key] = value; + } + + recordUrl(url: string): void { + if (!this.capturedUrls.includes(url)) { + this.capturedUrls.push(url); + } + } + + recordOpenedFile(filePath: string): void { + this.openedFiles.push(filePath); + } + + recordTrackedShell(shellId: string): void { + this.trackedShellIds.push(shellId); + } + + recordPoll(shellId: string): void { + if (!this.polledShellIds.includes(shellId)) { + this.polledShellIds.push(shellId); + } + } + + recordKill(shellId: string): void { + this.killedShellIds.push(shellId); + } + + build(): RunResult { + return { + scenarioId: this.scenarioId, + userPrompt: this.userPrompt, + toolCalls: [...this.toolCalls], + assistantMessages: [...this.assistantMessages], + finalText: this.finalText, + capturedUrls: [...this.capturedUrls], + openedFiles: [...this.openedFiles], + killedShellIds: [...this.killedShellIds], + trackedShellIds: [...this.trackedShellIds], + polledShellIds: [...this.polledShellIds], + trackedCommands: [...this.trackedCommands], + metadata: { ...this.metadata }, + }; + } +} + +export function extractUrls(text: string): string[] { + const matches = text.match(/(?:https?:\/\/|mailto:)[^\s)>\]"']+/g) ?? []; + + return matches.map((url) => url.replace(/[.,;]+$/, '')); +} + +export function isKillCommand(command: string): boolean { + return /^\s*(kill|pkill|killall)\b/.test(command); +} + +export function isOpenCommand(command: string): boolean { + return /^\s*(open|xdg-open|start)\b/.test(command.trim()); +} + +/** + * Drop shell string-literal content (single/double quoted spans and backslash-escaped + * characters) while preserving unquoted command words. A single-pass lexer is required + * because the `'\''` idiom agents use to embed apostrophes — e.g. `'Bob'\''s sleep coach'` — + * splits a value across multiple quote runs that a naive `'...'` regex cannot follow. + */ +function stripShellStringLiterals(command: string): string { + let out = ''; + let i = 0; + + while (i < command.length) { + const ch = command[i]; + + if (ch === "'") { + i += 1; + while (i < command.length && command[i] !== "'") { + i += 1; + } + i += 1; + out += ' '; + } else if (ch === '"') { + i += 1; + while (i < command.length && command[i] !== '"') { + if (command[i] === '\\' && i + 1 < command.length) { + i += 1; + } + i += 1; + } + i += 1; + out += ' '; + } else if (ch === '\\') { + i += 2; + out += ' '; + } else { + out += ch; + i += 1; + } + } + + return out; +} + +export function isForbiddenWatcherCommand(command: string): boolean { + // Scan only unquoted command words so a legitimate agent description such as + // `novu connect "A sleep coaching assistant"` (or `'Bob'\''s sleep coach'`) is not + // rejected for an embedded "sleep"/"tail"/"grep". + const normalized = stripShellStringLiterals(command).toLowerCase(); + + return ( + /\bsleep\b/.test(normalized) || + /\btail\b/.test(normalized) || + /\bgrep\b/.test(normalized) || + /\bps\b/.test(normalized) || + /\bschedulewakeup\b/.test(normalized) + ); +} + +export function shellSummary(shell: MockShellState): string { + if (shell.killed) { + return `Shell ${shell.id} was killed.`; + } + + return shell.emittedStdout.join('\n'); +} diff --git a/libs/agent-evals/src/core/resolve-package-file.ts b/libs/agent-evals/src/core/resolve-package-file.ts new file mode 100644 index 00000000000..d66a51ec525 --- /dev/null +++ b/libs/agent-evals/src/core/resolve-package-file.ts @@ -0,0 +1,7 @@ +import { createRequire } from 'node:module'; + +const require = createRequire(import.meta.url); + +export function resolvePackageFile(specifier: string): string { + return require.resolve(specifier); +} diff --git a/libs/agent-evals/src/core/tools.test.ts b/libs/agent-evals/src/core/tools.test.ts new file mode 100644 index 00000000000..96b12e816d8 --- /dev/null +++ b/libs/agent-evals/src/core/tools.test.ts @@ -0,0 +1,77 @@ +import fs from 'node:fs'; +import os from 'node:os'; +import path from 'node:path'; +import { afterAll, describe, expect, it } from 'vitest'; +import { RunRecorder } from './recorder.js'; +import { createHarnessContext, createHarnessTools } from './tools.js'; +import type { CommandParser, EvalScenario, Suite } from './types.js'; + +const parser: CommandParser = { matches: () => false, parse: () => ({}) }; + +const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'agent-evals-read-')); +fs.writeFileSync(path.join(tmpDir, 'README.md'), 'hello world'); + +afterAll(() => { + fs.rmSync(tmpDir, { recursive: true, force: true }); +}); + +function makeHarness() { + const scenario: EvalScenario = { + id: 'read-test', + category: 'test', + description: '', + userPrompt: '', + projectRoot: tmpDir, + scriptedAnswers: [], + }; + const suite: Suite = { + id: 'suite', + description: '', + systemPrompt: { text: '' }, + commandParser: parser, + scenarios: [], + }; + const recorder = new RunRecorder('read-test', 'prompt'); + const context = createHarnessContext(suite, scenario, recorder); + const { Read } = createHarnessTools(context); + const read = Read as unknown as { + execute: (args: { file_path: string }) => Promise<{ content?: string; error?: string }>; + }; + + return { read, recorder }; +} + +function readCalls(recorder: RunRecorder) { + return recorder.build().toolCalls.filter((call) => call.name === 'Read'); +} + +describe('Read tool records exactly once per call', () => { + it('records a single Read for a successful read (with byte count)', async () => { + const { read, recorder } = makeHarness(); + + const result = await read.execute({ file_path: 'README.md' }); + + expect(result.content).toBe('hello world'); + + const calls = readCalls(recorder); + expect(calls).toHaveLength(1); + expect(calls[0].result).toMatchObject({ bytes: 'hello world'.length }); + }); + + it('records a single Read for a PNG placeholder', async () => { + const { read, recorder } = makeHarness(); + + await read.execute({ file_path: 'qr.png' }); + + expect(readCalls(recorder)).toHaveLength(1); + }); + + it('records a single Read for a failed read', async () => { + const { read, recorder } = makeHarness(); + + const result = await read.execute({ file_path: 'does-not-exist.txt' }); + + expect(result.error).toBeDefined(); + expect(readCalls(recorder)).toHaveLength(1); + }); +}); diff --git a/libs/agent-evals/src/core/tools.ts b/libs/agent-evals/src/core/tools.ts new file mode 100644 index 00000000000..60ce83b0a34 --- /dev/null +++ b/libs/agent-evals/src/core/tools.ts @@ -0,0 +1,373 @@ +import fs from 'node:fs/promises'; +import path from 'node:path'; +import { tool } from 'ai'; +import { z } from 'zod'; +import { MockShellEngine } from './mock-shell.js'; +import { + extractUrls, + isForbiddenWatcherCommand, + isKillCommand, + isOpenCommand, + RunRecorder, + shellSummary, +} from './recorder.js'; +import type { EvalScenario, ParsedCommand, ScriptedAnswer, Suite } from './types.js'; +import { normalizePath } from './types.js'; + +export type HarnessContext = { + suite: Suite; + scenario: EvalScenario; + recorder: RunRecorder; + engine: MockShellEngine; + answerIndex: number; + lastBackgroundShellId?: string; + env: Record; +}; + +function pickScriptedAnswer( + scenario: EvalScenario, + question: string, + answerIndex: number +): ScriptedAnswer | undefined { + const remaining = scenario.scriptedAnswers.slice(answerIndex); + + for (const answer of remaining) { + if (answer.match?.test(question)) { + return answer; + } + + if (answer.questionContains && question.toLowerCase().includes(answer.questionContains.toLowerCase())) { + return answer; + } + } + + return remaining[0]; +} + +async function readFixtureFile(projectRoot: string, filePath: string): Promise { + const normalized = normalizePath(filePath); + const resolvedRoot = path.resolve(projectRoot); + const absolutePath = path.isAbsolute(normalized) + ? path.normalize(normalized) + : path.resolve(resolvedRoot, normalized); + + // Segment-safe containment: `path.relative` yields a `..`-prefixed (or absolute) + // result when the target escapes the root, so sibling roots like `-evil` + // no longer pass a naive prefix check. + const relative = path.relative(resolvedRoot, absolutePath); + + if (relative === '' || relative.startsWith('..') || path.isAbsolute(relative)) { + throw new Error(`Refusing to read path outside fixture project: ${filePath}`); + } + + return fs.readFile(absolutePath, 'utf8'); +} + +/** + * Read a single shell value, honoring single quotes, double quotes, and backslash + * escapes (including the `'\''` idiom agents use to embed apostrophes). Reading stops + * at the first unquoted whitespace. Returns the decoded value and how many characters + * were consumed so the caller can find the residual command. + */ +function readShellValue(input: string): { value: string; consumed: number } { + let out = ''; + let i = 0; + + while (i < input.length) { + const ch = input[i]; + + if (ch === "'") { + i += 1; + while (i < input.length && input[i] !== "'") { + out += input[i]; + i += 1; + } + i += 1; + } else if (ch === '"') { + i += 1; + while (i < input.length && input[i] !== '"') { + if (input[i] === '\\' && i + 1 < input.length) { + i += 1; + } + out += input[i]; + i += 1; + } + i += 1; + } else if (ch === '\\') { + if (i + 1 < input.length) { + out += input[i + 1]; + i += 2; + } else { + i += 1; + } + } else if (/\s/.test(ch) || ch === ';' || ch === '&') { + // Unquoted shell separators end the value so a one-line + // `export X=foo;npx novu connect …` leaves the connect command as the residual. + break; + } else { + out += ch; + i += 1; + } + } + + return { value: out, consumed: i }; +} + +/** + * Capture any leading `export VAR=` assignments into the harness env, then return + * the residual command (e.g. the `npx novu connect …` that follows). Agents commonly run + * the playbook's Step 3 block — an `export` plus the connect command — in a single shell + * call (joined by a newline, `;`, or `&&`); the residual must still execute so the connect + * command is tracked and streamed. Returns the original command unchanged when it does not + * start with an export. + */ +function captureLeadingExports(command: string, env: Record): string { + let rest = command; + let capturedAny = false; + + for (;;) { + const stripped = rest.replace(/^[\s;&]+/, ''); + const match = stripped.match(/^export\s+([A-Z_][A-Z0-9_]*)=/); + + if (!match?.[1]) { + break; + } + + capturedAny = true; + const afterEq = stripped.slice(match[0].length); + const { value, consumed } = readShellValue(afterEq); + env[match[1]] = value; + rest = afterEq.slice(consumed); + } + + return capturedAny ? rest.replace(/^[\s;&]+/, '') : command; +} + +export function createHarnessTools(context: HarnessContext) { + const Bash = tool({ + description: + 'Executes a bash command. Use run_in_background: true for long-running commands, then poll with BashOutput.', + inputSchema: z.object({ + command: z.string().describe('The bash command to execute.'), + run_in_background: z.boolean().optional().describe('Run the command in the background.'), + description: z.string().optional().describe('Short description of what the command does.'), + }), + execute: async ({ command: rawCommand, run_in_background: runInBackground }) => { + context.recorder.recordToolCall('Bash', { command: rawCommand, run_in_background: runInBackground }); + + if (isForbiddenWatcherCommand(rawCommand)) { + return { + error: 'Command rejected by harness.', + stdout: '', + stderr: 'Do not use sleep/tail/grep watchers. Poll BashOutput on the background shell instead.', + exitCode: 1, + }; + } + + // Capture leading `export VAR=…` assignments, then continue with whatever follows + // (e.g. the connect command in the same block). A pure export block has no residual. + const command = captureLeadingExports(rawCommand, context.env); + + if (!command) { + return { stdout: '', stderr: '', exitCode: 0 }; + } + + if (isOpenCommand(command)) { + const fileMatch = command.match(/["']([^"']+\.png)["']/i) ?? command.match(/\s(\S+\.png)\s*$/i); + + if (fileMatch?.[1]) { + context.recorder.recordOpenedFile(fileMatch[1]); + } + + return { stdout: 'Opened image viewer.', stderr: '', exitCode: 0 }; + } + + if (isKillCommand(command)) { + const shellId = context.lastBackgroundShellId; + + if (shellId) { + context.engine.killShell(shellId); + context.recorder.recordKill(shellId); + } + + return { stdout: shellId ? `Killed shell ${shellId}` : 'No shell to kill.', stderr: '', exitCode: 0 }; + } + + const shell = context.engine.createShell(command, Boolean(runInBackground), context.env); + + if (shell.isTracked) { + context.recorder.recordTrackedCommand(command); + context.recorder.recordTrackedShell(shell.id); + context.lastBackgroundShellId = shell.id; + + if (shell.parsed && context.suite.onTrackedCommand) { + context.suite.onTrackedCommand(command, shell.parsed, context.recorder); + } + } + + if (runInBackground) { + context.engine.pollShell(shell.id); + const backgroundStdout = shell.emittedStdout.join('\n'); + + for (const url of extractUrls(backgroundStdout)) { + context.recorder.recordUrl(url); + } + + return { + shellId: shell.id, + stdout: backgroundStdout, + stderr: '', + running: !shell.completed, + }; + } + + context.engine.pollShell(shell.id); + + while (!shell.completed && shell.chunkIndex < shell.chunks.length) { + context.engine.pollShell(shell.id); + } + + const stdout = shell.emittedStdout.join('\n'); + + for (const url of extractUrls(stdout)) { + context.recorder.recordUrl(url); + } + + return { stdout, stderr: '', exitCode: shell.exitCode ?? 0 }; + }, + }); + + const BashOutput = tool({ + description: 'Poll stdout/stderr from a background shell started with Bash run_in_background: true.', + inputSchema: z.object({ + shellId: z.string().describe('Background shell id returned by Bash.'), + }), + execute: async ({ shellId }) => { + context.recorder.recordToolCall('BashOutput', { shellId }); + + const shell = context.engine.pollShell(shellId); + + if (!shell) { + return { error: `Unknown shell id: ${shellId}`, stdout: '', completed: true, exitCode: 1 }; + } + + context.recorder.recordPoll(shellId); + + const stdout = shellSummary(shell); + + for (const url of extractUrls(stdout)) { + context.recorder.recordUrl(url); + } + + for (const pattern of context.suite.sentinelFilePatterns ?? []) { + const match = stdout.match(pattern); + + if (match?.[1]) { + try { + // Route through the fixture-root guard: the path is captured from + // agent-controlled shell output, so an injected absolute path must not + // escape the scenario workspace. + const fileContents = await readFixtureFile(context.scenario.projectRoot, match[1]); + + for (const url of extractUrls(fileContents)) { + context.recorder.recordUrl(url); + } + } catch { + // Sentinel file may not exist (or sits outside the fixture root); ignore. + } + } + } + + return { + shellId, + stdout, + completed: shell.completed, + exitCode: shell.exitCode, + killed: shell.killed, + }; + }, + }); + + const AskUserQuestion = tool({ + description: 'Ask the user a structured question with 2-4 options.', + inputSchema: z.object({ + question: z.string(), + options: z + .array( + z.object({ + id: z.string(), + label: z.string(), + description: z.string().optional(), + }) + ) + .min(2) + .max(4), + }), + execute: async ({ question, options }) => { + const scripted = pickScriptedAnswer(context.scenario, question, context.answerIndex); + context.answerIndex += 1; + + const selected = + options.find((option) => option.id === scripted?.optionId) ?? + options.find((option) => option.label === scripted?.label) ?? + options[0]; + + context.recorder.recordToolCall('AskUserQuestion', { question, options }, { selectedId: selected.id }); + + return { selectedId: selected.id, selectedLabel: selected.label }; + }, + }); + + const Read = tool({ + description: 'Read a file from the project workspace.', + inputSchema: z.object({ + file_path: z.string(), + }), + execute: async ({ file_path: filePath }) => { + // Record exactly once per call, inside each branch, so a successful read is not + // logged twice (which would double every `toolCallsNamed(result, 'Read')` count + // and corrupt the tool-call timeline). + if (filePath.includes('/tmp/') || filePath.endsWith('.log')) { + context.recorder.recordToolCall('Read', { file_path: filePath }); + + return { error: 'Reading log files is discouraged in this flow.' }; + } + + if (filePath.endsWith('.png')) { + context.recorder.recordToolCall('Read', { file_path: filePath }); + + return { content: '[PNG image omitted by harness]' }; + } + + try { + const content = await readFixtureFile(context.scenario.projectRoot, filePath); + context.recorder.recordToolCall('Read', { file_path: filePath }, { bytes: content.length }); + + return { content }; + } catch (error) { + context.recorder.recordToolCall('Read', { file_path: filePath }); + + return { error: error instanceof Error ? error.message : 'Failed to read file.' }; + } + }, + }); + + return { Bash, BashOutput, AskUserQuestion, Read }; +} + +export function createHarnessContext( + suite: Suite, + scenario: EvalScenario, + recorder: RunRecorder +): HarnessContext { + return { + suite, + scenario, + recorder, + engine: new MockShellEngine(scenario, suite.commandParser), + answerIndex: 0, + env: {}, + }; +} + +export type HarnessTools = ReturnType; diff --git a/libs/agent-evals/src/core/types.ts b/libs/agent-evals/src/core/types.ts new file mode 100644 index 00000000000..5f681c28f42 --- /dev/null +++ b/libs/agent-evals/src/core/types.ts @@ -0,0 +1,142 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; + +export type GraderResult = 'pass' | 'fail' | 'skip'; + +/** A grader can return a bare status, or a status with a human-readable reason (used for fails). */ +export type GraderOutcome = { + status: GraderResult; + reason?: string; +}; + +export type GraderFn = (result: RunResult) => GraderResult | GraderOutcome | Promise; + +export type GraderDefinition = { + kind: 'deterministic' | 'judge'; + run: GraderFn; + /** Human-readable label shown in eval reports (defaults to the grader key). */ + label?: string; +}; + +export type ToolCallRecord = { + name: string; + args: Record; + result?: unknown; + timestamp: number; +}; + +/** A command parsed by a suite's CommandParser. Suites narrow this to a concrete shape. */ +export type ParsedCommand = Record; + +export type TapeChunk = { + stdout: string; + when?: (parsed: TParsed) => boolean; +}; + +export type Tape = { + chunks: Array>; + exitCode?: number; + /** Optional suite-defined validation; return an error string to make the tracked command fail. */ + validate?: (parsed: TParsed) => string | null; + /** + * When this returns true for a parsed command, the shell stays running (no exit code) + * after emitting its chunks and only completes when the agent kills it. Models real + * long-running CLI branches (e.g. the no-token Slack connect that waits for a config + * token) so a "kill before re-run" requirement is genuinely enforceable. + */ + pendingWhen?: (parsed: TParsed) => boolean; +}; + +export type ScriptedAnswer = { + match?: RegExp; + questionContains?: string; + optionId: string; + label?: string; +}; + +export type EvalScenario = { + id: string; + category: string; + description: string; + userPrompt: string; + projectRoot: string; + scriptedAnswers: ScriptedAnswer[]; + tape?: Tape; + followUpMessages?: string[]; + /** When set, a follow-up is injected if the agent selects this option id in a picker. */ + followUpOnOptionId?: string; + /** Scenario-specific configuration consumed by suite graders. */ + metadata?: Record; +}; + +export type RunResult = { + scenarioId: string; + userPrompt: string; + toolCalls: ToolCallRecord[]; + assistantMessages: string[]; + finalText: string; + capturedUrls: string[]; + openedFiles: string[]; + killedShellIds: string[]; + /** Shell ids of commands the suite parser marked as tracked (e.g. the connect command). */ + trackedShellIds: string[]; + polledShellIds: string[]; + /** Raw command strings the suite parser marked as tracked. */ + trackedCommands: string[]; + /** Suite-owned captures (e.g. the drafted agent description). */ + metadata: Record; +}; + +export type MockShellState = { + id: string; + command: string; + parsed: TParsed | null; + isTracked: boolean; + chunks: string[]; + emittedStdout: string[]; + chunkIndex: number; + exitCode: number | null; + completed: boolean; + killed: boolean; +}; + +/** Parses shell commands a suite cares about (e.g. `novu connect`). */ +export type CommandParser = { + matches: (command: string) => boolean; + parse: (command: string, env: Record) => TParsed; +}; + +export type RegisteredScenario = { + scenario: EvalScenario; + graders: Record; +}; + +/** A suite plugs suite-specific behavior into the generic harness. */ +export type Suite = { + id: string; + description: string; + /** Playbook/instructions injected as the system prompt. */ + systemPrompt: { path: string } | { text: string }; + /** Optional override for the agent preamble prepended to the playbook. */ + systemPromptPreamble?: string; + commandParser: CommandParser; + scenarios: Array>; + /** stdout patterns whose captured path (group 1) holds a URL to read and record. */ + sentinelFilePatterns?: RegExp[]; + /** Text pattern in assistant output that should trigger a scripted follow-up message. */ + followUpTextPattern?: RegExp; + /** Hook to capture suite-specific metadata when a tracked command runs. */ + onTrackedCommand?: ( + command: string, + parsed: TParsed, + recorder: { setMetadata: (k: string, v: unknown) => void } + ) => void; +}; + +const currentDir = path.dirname(fileURLToPath(import.meta.url)); + +export const PACKAGE_ROOT = path.resolve(currentDir, '../..'); + +export function normalizePath(input: string): string { + return input.replace(/\\/g, '/').replace(/^\.\/+/, ''); +} diff --git a/libs/agent-evals/src/load-env.ts b/libs/agent-evals/src/load-env.ts new file mode 100644 index 00000000000..cf76c506cf6 --- /dev/null +++ b/libs/agent-evals/src/load-env.ts @@ -0,0 +1,7 @@ +import { dirname, resolve } from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { config } from 'dotenv'; + +const packageRoot = resolve(dirname(fileURLToPath(import.meta.url)), '..'); + +config({ path: resolve(packageRoot, '.env') }); diff --git a/libs/agent-evals/src/suites/agent-onboarding/adapters.ts b/libs/agent-evals/src/suites/agent-onboarding/adapters.ts new file mode 100644 index 00000000000..08b282f2bcb --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/adapters.ts @@ -0,0 +1,29 @@ +import { createJudge, type Judge } from 'vitest-evals'; +import type { GraderDefinition, GraderOutcome, GraderResult, RunResult } from '../../core/types.js'; + +function toOutcome(value: GraderResult | GraderOutcome): GraderOutcome { + return typeof value === 'string' ? { status: value } : value; +} + +function outcomeToScore(outcome: GraderOutcome): number { + if (outcome.status === 'skip') { + return 1; + } + + return outcome.status === 'pass' ? 1 : 0; +} + +export function graderToJudge(name: string, definition: GraderDefinition): Judge { + return createJudge(definition.label ?? name, async ({ output }) => { + const outcome = toOutcome(await definition.run(output as RunResult)); + + return { + score: outcomeToScore(outcome), + metadata: outcome.reason ? { rationale: outcome.reason, status: outcome.status } : { status: outcome.status }, + }; + }); +} + +export function gradersToJudges(graders: Record): Judge[] { + return Object.entries(graders).map(([name, definition]) => graderToJudge(name, definition)); +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/catalog.ts b/libs/agent-evals/src/suites/agent-onboarding/catalog.ts new file mode 100644 index 00000000000..2275073bffc --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/catalog.ts @@ -0,0 +1,212 @@ +import { defineGraders, fail, judge, labeled, toolCallsNamed, transcriptText } from '../../core/graders.js'; +import { isForbiddenWatcherCommand } from '../../core/recorder.js'; +import type { GraderOutcome, RunResult } from '../../core/types.js'; + +/** The drafted agent description is captured into metadata by the suite's onTrackedCommand hook. */ +export function descriptionText(result: RunResult): string { + return typeof result.metadata.description === 'string' ? result.metadata.description : ''; +} + +export function connectCommands(result: RunResult): string[] { + return result.trackedCommands; +} + +function firstConnectCall(result: RunResult) { + return result.toolCalls.find((call) => call.name === 'Bash' && /\bconnect\b/.test(String(call.args.command ?? ''))); +} + +export const judgePrompts = { + personaAudienceFit: + 'Does the drafted agent description frame the agent for the product end-user audience in domain language, without drifting into a developer/coding-assistant persona?', + noInfraMcpSemantic: + 'Does the drafted agent description avoid naming internal infrastructure or backend plumbing (databases, email delivery APIs, queues, caches, dev tooling) even via synonyms?', + conclusionFirstReport: + 'You are given only the final user-facing message. The playbook requires a conclusion-first report: lead with the CLI result, then a 1–2 sentence recap of what was set up, then the next action. Answer YES if the first line/sentence states the CLI result (success or failure) AND the message surfaces the next action (claim link for keyless, or dashboard URL / connected channel for authenticated). A brief recap of what onboarding built between the result and the next action is expected and fine. Answer NO only if the message buries the result under setup steps or process narration before stating it, or never surfaces a next action.', +}; + +export const catalog = { + noSecretKeyFlag: (result: RunResult): GraderOutcome | 'pass' => + connectCommands(result).every((cmd) => !/--secret-key\b/.test(cmd) && !/\bNOVU_SECRET_KEY=/.test(cmd)) + ? 'pass' + : fail('a connect command passed --secret-key or NOVU_SECRET_KEY'), + + usedDashboardOAuthWhenPrompted: (result: RunResult): GraderOutcome | 'pass' => { + if (!/signed in to the Novu dashboard/i.test(result.userPrompt)) { + return 'pass'; + } + + const commands = connectCommands(result); + + if (commands.length === 0) { + return fail('user was signed into the dashboard but connect was never run'); + } + + return commands.every((cmd) => !/--keyless\b/.test(cmd)) + ? 'pass' + : fail('user was signed into the dashboard but a connect command used --keyless instead of dashboard OAuth'); + }, + + backgroundConnectShell: (result: RunResult): GraderOutcome | 'pass' => { + const connectCall = firstConnectCall(result); + + if (!connectCall) { + return fail('no connect command was run'); + } + + if (!connectCall.args.run_in_background) { + return fail('connect command was not run in the background (run_in_background was not set)'); + } + + return result.polledShellIds.length > 0 + ? 'pass' + : fail('background connect shell was never polled with BashOutput'); + }, + + noTimersNoWatchers: (result: RunResult): GraderOutcome | 'pass' => { + const forbiddenCall = result.toolCalls.find((call) => { + if (call.name !== 'Bash') { + return false; + } + + const command = String(call.args.command ?? ''); + + return isForbiddenWatcherCommand(command); + }); + + if (forbiddenCall) { + return fail(`used a timer/watcher command: ${String(forbiddenCall.args.command ?? '')}`); + } + + const readLogCall = result.toolCalls.find((call) => { + if (call.name !== 'Read') { + return false; + } + + const filePath = String(call.args.file_path ?? ''); + + return filePath.includes('/tmp/') || filePath.endsWith('.log'); + }); + + return readLogCall + ? fail(`tailed a log file instead of polling: ${String(readLogCall.args.file_path ?? '')}`) + : 'pass'; + }, + + usedPickerForDecisions: (result: RunResult): GraderOutcome | 'pass' => + toolCallsNamed(result, 'AskUserQuestion').length >= 1 + ? 'pass' + : fail('no AskUserQuestion picker was used for decisions'), + + pastedLiteralUrl: + (expectedUrl: string) => + (result: RunResult): GraderOutcome | 'pass' => + result.capturedUrls.includes(expectedUrl) || transcriptText(result).includes(expectedUrl) + ? 'pass' + : fail(`expected URL not surfaced to the user: ${expectedUrl}`), + + descriptionExcludesInfraTokens: + (tokens: string[]) => + (result: RunResult): GraderOutcome | 'pass' => { + const description = descriptionText(result).toLowerCase(); + const offending = tokens.filter((token) => description.includes(token.toLowerCase())); + + return offending.length > 0 ? fail(`description mentions infra tokens: ${offending.join(', ')}`) : 'pass'; + }, + + descriptionIncludesTokens: + (tokens: string[]) => + (result: RunResult): GraderOutcome | 'pass' => { + const description = descriptionText(result).toLowerCase(); + + return tokens.some((token) => description.includes(token.toLowerCase())) + ? 'pass' + : fail(`description is missing all expected tokens: ${tokens.join(', ')}`); + }, + + noConnectOnKeylessWhatsapp: (result: RunResult): GraderOutcome | 'pass' => { + if (connectCommands(result).length > 0) { + return fail('ran a connect command on a keyless WhatsApp flow that should redirect to the dashboard'); + } + + const text = transcriptText(result); + const mentionsDashboard = /dashboard\.novu\.co|\bdashboard\b/i.test(text); + const directsThere = /dashboard\.novu\.co|redirect|continue|sign[\s-]?(in|up)|head (over )?to|go to|open/i.test( + text + ); + + return mentionsDashboard && directsThere ? 'pass' : fail('did not direct the user to the dashboard'); + }, + + confirmedBeforeRun: (result: RunResult): GraderOutcome | 'pass' => { + const approveIndex = result.toolCalls.findIndex( + (call) => + call.name === 'AskUserQuestion' && + (call.result as { selectedId?: string } | undefined)?.selectedId === 'approve' + ); + const firstConnectIndex = result.toolCalls.findIndex( + (call) => call.name === 'Bash' && /\bconnect\b/.test(String(call.args.command ?? '')) + ); + + if (firstConnectIndex === -1) { + return 'pass'; + } + + return approveIndex !== -1 && approveIndex < firstConnectIndex + ? 'pass' + : fail('ran connect without an approved confirmation picker beforehand'); + }, + + qrHostAware: (result: RunResult): GraderOutcome | 'pass' => { + const openedPng = result.openedFiles.some((file) => file.endsWith('.png')); + // The playbook's host-aware delivery also allows chat UIs to embed the PNG as an + // inline Markdown image (`![…]()`) instead of an OS `open`. + const embeddedPng = /!\[[^\]]*]\([^)]*\.png[^)]*\)/i.test(transcriptText(result)); + + return openedPng || embeddedPng ? 'pass' : fail('did not open or embed the QR code image'); + }, + + reranWithSlackToken: (result: RunResult): GraderOutcome | 'pass' => + connectCommands(result).some((cmd) => /--slack-config-token\b/.test(cmd)) + ? 'pass' + : fail('did not re-run connect with --slack-config-token'), + + killedFirstConnectShell: (result: RunResult): GraderOutcome | 'pass' => + result.killedShellIds.length >= 1 ? 'pass' : fail('the first connect shell was never killed'), + + readAuthUrlFile: (result: RunResult): GraderOutcome | 'pass' => + result.toolCalls.some( + (call) => call.name === 'Read' && String(call.args.file_path ?? '').includes('novu-connect-auth-url') + ) || + result.capturedUrls.some((url) => url.includes('/oauth/device')) || + transcriptText(result).includes('/oauth/device') + ? 'pass' + : fail('never read the auth-url file or surfaced the /oauth/device URL'), + + reportedSuccess: (result: RunResult): GraderOutcome | 'pass' => + /agent is (now )?live|✓ your agent/i.test(transcriptText(result)) + ? 'pass' + : fail('final report did not confirm the agent is live'), + + noConnectCommands: (result: RunResult): GraderOutcome | 'pass' => + connectCommands(result).length === 0 ? 'pass' : fail('ran a connect command when none was expected'), + + usedSecureTokenPath: (result: RunResult): GraderOutcome | 'pass' => + connectCommands(result).every((cmd) => !/--slack-config-token\b/.test(cmd)) + ? 'pass' + : fail('passed --slack-config-token inline instead of the secure token path'), +}; + +export const sharedJudgeGraders = defineGraders({ + personaAudienceFit: labeled( + 'frames the agent for the product end-user audience in domain language', + judge(judgePrompts.personaAudienceFit, (result) => [descriptionText(result), transcriptText(result)].join('\n')) + ), + noInfraMcpSemantic: labeled( + 'avoids naming internal infrastructure in the drafted agent description', + judge(judgePrompts.noInfraMcpSemantic, (result) => descriptionText(result)) + ), + conclusionFirstReport: labeled( + 'leads the final report with the CLI result and next action', + judge(judgePrompts.conclusionFirstReport, (result) => result.finalText) + ), +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/connect-parser.test.ts b/libs/agent-evals/src/suites/agent-onboarding/connect-parser.test.ts new file mode 100644 index 00000000000..fd950b67d69 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/connect-parser.test.ts @@ -0,0 +1,82 @@ +import { describe, expect, it } from 'vitest'; +import { type ConnectFlags, connectParser, connectValidate } from './connect-parser.js'; +import { buildDefaultTape } from './tape.js'; + +const baseFlags: ConnectFlags = { keyless: true, secretKey: false, ci: true, channel: 'slack' }; + +describe('connectParser', () => { + it('strips quotes from --channel values', () => { + const flags = connectParser.parse('npx novu@latest connect "Wine concierge" --ci --keyless --channel "slack"', {}); + + expect(flags.channel).toBe('slack'); + }); + + it('parses a positional description that follows flags', () => { + const flags = connectParser.parse( + 'npx novu@latest connect --ci --keyless --channel slack "Wine staff concierge"', + {} + ); + + expect(flags.description).toBe('Wine staff concierge'); + expect(flags.channel).toBe('slack'); + }); + + it('parses a positional description that precedes flags', () => { + const flags = connectParser.parse('npx novu connect "Wine concierge" --ci --channel slack', {}); + + expect(flags.description).toBe('Wine concierge'); + }); + + it('handles the embedded-apostrophe idiom in a positional description', () => { + const flags = connectParser.parse(`npx novu connect 'Bob'\\''s wine helper' --ci --channel slack`, {}); + + expect(flags.description).toBe("Bob's wine helper"); + }); + + it('resolves a $NOVU_AGENT_DESCRIPTION positional from env', () => { + const flags = connectParser.parse('npx novu connect "$NOVU_AGENT_DESCRIPTION" --ci --keyless --channel slack', { + NOVU_AGENT_DESCRIPTION: 'Wine staff concierge', + }); + + expect(flags.description).toBe('Wine staff concierge'); + }); + + it('reads --slack-config-token without surrounding quotes', () => { + const flags = connectParser.parse('npx novu connect --ci --channel slack --slack-config-token "xoxe.test"', {}); + + expect(flags.slackConfigToken).toBe('xoxe.test'); + }); +}); + +describe('connectValidate', () => { + it('requires a channel when allowedChannels is set', () => { + const error = connectValidate({ allowedChannels: ['slack'] })({ ...baseFlags, channel: undefined }); + + expect(error).toMatch(/Expected --channel/); + }); + + it('rejects a channel outside the allow list', () => { + const error = connectValidate({ allowedChannels: ['slack'] })({ ...baseFlags, channel: 'email' }); + + expect(error).toMatch(/Unexpected channel/); + }); + + it('passes a valid keyless command', () => { + expect(connectValidate({ allowedChannels: ['slack'], requireKeyless: true })(baseFlags)).toBeNull(); + }); +}); + +describe('buildDefaultTape', () => { + it('requires --keyless by default', () => { + const tape = buildDefaultTape({ allowedChannels: ['slack'] }); + + expect(tape.validate?.({ ...baseFlags, keyless: false })).toMatch(/--keyless/); + expect(tape.validate?.({ ...baseFlags, keyless: true })).toBeNull(); + }); + + it('does not require --keyless when requireNoKeyless is set', () => { + const tape = buildDefaultTape({ allowedChannels: ['slack'], requireNoKeyless: true }); + + expect(tape.validate?.({ ...baseFlags, keyless: false })).toBeNull(); + }); +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts b/libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts new file mode 100644 index 00000000000..4517c308d7e --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts @@ -0,0 +1,211 @@ +import type { CommandParser } from '../../core/types.js'; + +export type ConnectFlags = { + keyless: boolean; + secretKey: boolean; + ci: boolean; + channel?: string; + description?: string; + slackConfigToken?: string; +}; + +export function isConnectCommand(command: string): boolean { + return /\bnovu(@[\w.-]+)?\s+connect\b/.test(command) || /\bnpx\s+[^\s]*novu[^\s]*\s+connect\b/.test(command); +} + +/** Flags that consume the following token as their value (so it is not a positional). */ +const VALUE_FLAGS = new Set(['--channel', '--slack-config-token', '--secret-key', '--api-url', '--dashboard-url']); + +/** + * Split a command into shell words, honoring single quotes, double quotes, and backslash + * escapes (including the `'\''` idiom agents use to embed apostrophes). Quotes are stripped + * from the decoded words, so `--channel "slack"` yields `['--channel', 'slack']` rather than + * leaving the quotes attached to the value. + */ +function tokenizeShellWords(input: string): string[] { + const words: string[] = []; + let i = 0; + + while (i < input.length) { + while (i < input.length && /\s/.test(input[i])) { + i += 1; + } + + if (i >= input.length) { + break; + } + + let word = ''; + + while (i < input.length && !/\s/.test(input[i])) { + const ch = input[i]; + + if (ch === "'") { + i += 1; + while (i < input.length && input[i] !== "'") { + word += input[i]; + i += 1; + } + i += 1; + } else if (ch === '"') { + i += 1; + while (i < input.length && input[i] !== '"') { + if (input[i] === '\\' && i + 1 < input.length) { + i += 1; + } + word += input[i]; + i += 1; + } + i += 1; + } else if (ch === '\\') { + if (i + 1 < input.length) { + word += input[i + 1]; + i += 2; + } else { + i += 1; + } + } else { + word += ch; + i += 1; + } + } + + words.push(word); + } + + return words; +} + +/** Read a flag's value, supporting both `--flag value` and `--flag=value` forms. */ +function readFlagValue(tokens: string[], flag: string): string | undefined { + for (let i = 0; i < tokens.length; i += 1) { + const token = tokens[i]; + + if (token === flag) { + return tokens[i + 1]; + } + + if (token.startsWith(`${flag}=`)) { + return token.slice(flag.length + 1); + } + } + + return undefined; +} + +/** + * Find the first positional argument after `connect` — i.e. the first token that is not a + * flag and is not consumed as a value-flag's value. This matches the playbook command no + * matter where the quoted description sits (e.g. `connect "Desc" --ci` or + * `connect --ci --channel slack "Desc"`). + */ +function findConnectPositional(tokens: string[]): string | undefined { + const connectIndex = tokens.indexOf('connect'); + + if (connectIndex === -1) { + return undefined; + } + + let skipNext = false; + + for (let i = connectIndex + 1; i < tokens.length; i += 1) { + const token = tokens[i]; + + if (skipNext) { + skipNext = false; + continue; + } + + if (token.startsWith('-')) { + if (VALUE_FLAGS.has(token)) { + skipNext = true; + } + + continue; + } + + return token; + } + + return undefined; +} + +function resolveDescription(command: string, tokens: string[], env: Record): string | undefined { + const exportMatch = command.match(/export\s+NOVU_AGENT_DESCRIPTION=(.+)/); + + if (exportMatch?.[1]) { + const [value] = tokenizeShellWords(exportMatch[1].trimStart()); + + if (value && !value.includes('$')) { + return value; + } + } + + const positional = findConnectPositional(tokens); + + // A positional that references the env var (e.g. "$NOVU_AGENT_DESCRIPTION") resolves from env. + if (positional && !positional.includes('$')) { + return positional; + } + + return env.NOVU_AGENT_DESCRIPTION; +} + +export const connectParser: CommandParser = { + matches: isConnectCommand, + parse(command, env) { + const tokens = tokenizeShellWords(command); + + const flags: ConnectFlags = { + keyless: /--keyless\b/.test(command), + secretKey: /--secret-key\b/.test(command) || /\bNOVU_SECRET_KEY=/.test(command), + ci: /--ci\b/.test(command), + }; + + flags.channel = readFlagValue(tokens, '--channel'); + flags.slackConfigToken = readFlagValue(tokens, '--slack-config-token'); + flags.description = resolveDescription(command, tokens, env); + + return flags; + }, +}; + +export type ConnectValidationOptions = { + /** Keyless flow: the connect command must pass `--keyless` (the default for this flow). */ + requireKeyless?: boolean; + /** Dashboard OAuth flow: the connect command must omit `--keyless` (the CLI default path). */ + requireNoKeyless?: boolean; + allowedChannels?: string[]; +}; + +export function connectValidate(options: ConnectValidationOptions): (flags: ConnectFlags) => string | null { + return (flags) => { + if (options.requireKeyless && !flags.keyless) { + return 'Expected --keyless flag for this scenario.'; + } + + if (options.requireNoKeyless && flags.keyless) { + return 'Did not expect --keyless flag for this scenario (use dashboard OAuth by omitting it).'; + } + + if (flags.secretKey) { + return 'Must not pass --secret-key in guided onboarding flow.'; + } + + if (options.allowedChannels?.length) { + if (!flags.channel) { + return `Expected --channel flag (one of: ${options.allowedChannels.join(', ')}).`; + } + + if (!options.allowedChannels.includes(flags.channel)) { + return `Unexpected channel "${flags.channel}". Expected one of: ${options.allowedChannels.join(', ')}.`; + } + } + + if (!flags.ci) { + return 'Expected --ci flag.'; + } + + return null; + }; +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/graders.test.ts b/libs/agent-evals/src/suites/agent-onboarding/graders.test.ts new file mode 100644 index 00000000000..589a951b155 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/graders.test.ts @@ -0,0 +1,68 @@ +import { describe, expect, it } from 'vitest'; +import type { RunResult } from '../../core/types.js'; +import { graderToJudge } from './adapters.js'; +import { graders as keylessWhatsappGraders } from './scenarios/keyless-whatsapp-redirect/graders.js'; + +function buildResult(partial: Partial): RunResult { + return { + scenarioId: partial.scenarioId ?? 'test', + userPrompt: partial.userPrompt ?? 'Connect WhatsApp', + toolCalls: partial.toolCalls ?? [], + assistantMessages: partial.assistantMessages ?? [], + finalText: partial.finalText ?? '', + capturedUrls: partial.capturedUrls ?? [], + openedFiles: partial.openedFiles ?? [], + killedShellIds: partial.killedShellIds ?? [], + trackedShellIds: partial.trackedShellIds ?? [], + polledShellIds: partial.polledShellIds ?? [], + trackedCommands: partial.trackedCommands ?? [], + metadata: partial.metadata ?? {}, + }; +} + +async function averageScore( + graders: Record unknown }>, + result: RunResult +): Promise { + const judges = Object.entries(graders).map(([name, definition]) => graderToJudge(name, definition)); + const scores = await Promise.all( + judges.map(async (judge) => { + const verdict = await judge.assess({ output: result } as never); + + return verdict.score; + }) + ); + + if (scores.length === 0) { + return 0; + } + + return scores.reduce((sum, score) => sum + score, 0) / scores.length; +} + +describe('keyless-whatsapp-redirect graders', () => { + it('scores a passing synthetic run at 1.0', async () => { + const passing = buildResult({ + scenarioId: 'keyless-whatsapp-redirect', + finalText: 'Please continue in https://dashboard.novu.co', + trackedCommands: [], + toolCalls: [{ name: 'AskUserQuestion', args: {}, timestamp: Date.now() }], + }); + + const score = await averageScore(keylessWhatsappGraders, passing); + + expect(score).toBe(1); + }); + + it('scores a failing synthetic run below 1.0', async () => { + const failing = buildResult({ + scenarioId: 'keyless-whatsapp-redirect', + finalText: 'Running connect now', + trackedCommands: ['npx novu connect --ci --channel whatsapp'], + }); + + const score = await averageScore(keylessWhatsappGraders, failing); + + expect(score).toBeLessThan(1); + }); +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/harness.ts b/libs/agent-evals/src/suites/agent-onboarding/harness.ts new file mode 100644 index 00000000000..93757e96cc6 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/harness.ts @@ -0,0 +1,166 @@ +import fs from 'node:fs/promises'; +import { anthropic } from '@ai-sdk/anthropic'; +import { generateText, type ModelMessage, stepCountIs } from 'ai'; +import { createHarness } from 'vitest-evals/harness'; +import { RunRecorder } from '../../core/recorder.js'; +import { createHarnessContext, createHarnessTools } from '../../core/tools.js'; +import type { EvalScenario, ParsedCommand, RunResult, Suite } from '../../core/types.js'; + +const DEFAULT_PREAMBLE = [ + 'You are an AI coding agent executing the following playbook exactly.', + 'Follow the playbook precisely. Use the provided tools.', + 'You are running in a Claude Code-like environment with Bash, BashOutput, AskUserQuestion, and Read tools.', + 'Read any relevant fixture files in the workspace before acting.', +].join('\n'); + +const docCache = new Map(); + +async function resolveSystemPrompt(suite: Suite): Promise { + const preamble = suite.systemPromptPreamble ?? DEFAULT_PREAMBLE; + + if ('text' in suite.systemPrompt) { + return [preamble, '', suite.systemPrompt.text].join('\n'); + } + + const docPath = suite.systemPrompt.path; + let playbook = docCache.get(docPath); + + if (!playbook) { + playbook = await fs.readFile(docPath, 'utf8'); + docCache.set(docPath, playbook); + } + + return [preamble, '', playbook].join('\n'); +} + +function shouldInjectFollowUp( + result: { text: string; steps: Array<{ toolResults?: Array<{ output?: unknown }> }> }, + suite: Suite, + scenario: EvalScenario +): boolean { + if (!scenario.followUpMessages?.length) { + return false; + } + + if (suite.followUpTextPattern?.test(result.text)) { + return true; + } + + if (!scenario.followUpOnOptionId) { + return false; + } + + return result.steps.some((step) => + step.toolResults?.some((toolResult) => { + const output = toolResult.output as { selectedId?: string } | undefined; + + return output?.selectedId === scenario.followUpOnOptionId; + }) + ); +} + +function toJsonSafeRunResult(result: RunResult): RunResult { + return JSON.parse( + JSON.stringify(result, (_key, value) => { + if (value === undefined) { + return null; + } + + return value; + }) + ) as RunResult; +} + +export type ScenarioHarnessOptions = { + suite: Suite; + scenario: EvalScenario; + system: string; + model?: string; + maxSteps?: number; + temperature?: number; +}; + +function resolveMaxSteps(explicit?: number): number { + if (explicit !== undefined) { + return explicit; + } + + const fromEnv = Number.parseInt(process.env.NOVU_EVAL_MAX_STEPS ?? '', 10); + + return Number.isFinite(fromEnv) && fromEnv > 0 ? fromEnv : 40; +} + +/** + * Default to 0 for deterministic, reproducible grading. A non-zero default would make + * run-to-run results depend on sampling noise, so a flaky prompt and a real regression + * become indistinguishable. Override via NOVU_EVAL_TEMPERATURE only for robustness sampling. + */ +function resolveTemperature(explicit?: number): number { + if (explicit !== undefined) { + return explicit; + } + + const fromEnv = Number.parseFloat(process.env.NOVU_EVAL_TEMPERATURE ?? ''); + + return Number.isFinite(fromEnv) && fromEnv >= 0 ? fromEnv : 0; +} + +export function scenarioHarness(options: ScenarioHarnessOptions) { + const modelName = options.model ?? process.env.NOVU_EVAL_MODEL ?? 'claude-sonnet-4-5'; + const maxSteps = resolveMaxSteps(options.maxSteps); + const temperature = resolveTemperature(options.temperature); + + return createHarness({ + name: `agent-onboarding/${options.scenario.id}`, + run: async ({ input }) => { + const recorder = new RunRecorder(options.scenario.id, input); + const context = createHarnessContext(options.suite, options.scenario, recorder); + const tools = createHarnessTools(context); + const messages: ModelMessage[] = [{ role: 'user', content: input }]; + const followUps = [...(options.scenario.followUpMessages ?? [])]; + const maxTurns = followUps.length + 1; + let lastResult: Awaited> | undefined; + + for (let turn = 0; turn < maxTurns; turn += 1) { + lastResult = await generateText({ + model: anthropic(modelName), + system: options.system, + messages, + tools, + temperature, + stopWhen: stepCountIs(maxSteps), + }); + + recorder.recordAssistantMessage(lastResult.text); + messages.push(...lastResult.response.messages); + + if (followUps.length > 0 && shouldInjectFollowUp(lastResult, options.suite, options.scenario)) { + const nextMessage = followUps.shift(); + + if (nextMessage) { + messages.push({ role: 'user', content: nextMessage }); + } + + continue; + } + + break; + } + + return { + output: toJsonSafeRunResult(recorder.build()), + usage: { + provider: 'anthropic', + model: modelName, + inputTokens: lastResult?.usage?.inputTokens, + outputTokens: lastResult?.usage?.outputTokens, + totalTokens: lastResult?.usage?.totalTokens, + }, + }; + }, + }); +} + +export async function loadSuiteSystemPrompt(suite: Suite): Promise { + return resolveSystemPrompt(suite); +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/index.ts b/libs/agent-evals/src/suites/agent-onboarding/index.ts new file mode 100644 index 00000000000..d4bc953f8b9 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/index.ts @@ -0,0 +1,53 @@ +import { resolvePackageFile } from '../../core/resolve-package-file.js'; +import type { Suite } from '../../core/types.js'; +import { type ConnectFlags, connectParser } from './connect-parser.js'; +import { graders as dashboardPromptLoginGraders } from './scenarios/dashboard-prompt-login/graders.js'; +import { scenario as dashboardPromptLoginScenario } from './scenarios/dashboard-prompt-login/scenario.js'; +import { graders as disciplineNoTimersGraders } from './scenarios/discipline-no-timers/graders.js'; +import { scenario as disciplineNoTimersScenario } from './scenarios/discipline-no-timers/scenario.js'; +import { graders as emailHandoffGraders } from './scenarios/email-handoff/graders.js'; +import { scenario as emailHandoffScenario } from './scenarios/email-handoff/scenario.js'; +import { graders as keylessSlackSecureGraders } from './scenarios/keyless-slack-secure/graders.js'; +import { scenario as keylessSlackSecureScenario } from './scenarios/keyless-slack-secure/scenario.js'; +import { graders as keylessWhatsappRedirectGraders } from './scenarios/keyless-whatsapp-redirect/graders.js'; +import { scenario as keylessWhatsappRedirectScenario } from './scenarios/keyless-whatsapp-redirect/scenario.js'; +import { graders as personaInfraExclusionGraders } from './scenarios/persona-infra-exclusion/graders.js'; +import { scenario as personaInfraExclusionScenario } from './scenarios/persona-infra-exclusion/scenario.js'; +import { graders as slackInChatRerunGraders } from './scenarios/slack-in-chat-rerun/graders.js'; +import { scenario as slackInChatRerunScenario } from './scenarios/slack-in-chat-rerun/scenario.js'; +import { graders as telegramSecureQrGraders } from './scenarios/telegram-secure-qr/graders.js'; +import { scenario as telegramSecureQrScenario } from './scenarios/telegram-secure-qr/scenario.js'; + +export const AGENT_ONBOARDING_DOC_PATH = resolvePackageFile('@novu/shared/docs/agent-onboarding.md'); + +const SYSTEM_PROMPT_PREAMBLE = [ + 'You are an AI coding agent executing the Novu agent onboarding playbook exactly.', + 'Follow the playbook precisely. Use the provided tools.', + 'You are running in a Claude Code-like environment with Bash, BashOutput, AskUserQuestion, and Read tools.', + 'The project fixture files are in the current workspace; read README.md and package.json before drafting the agent description.', +].join('\n'); + +export const agentOnboardingSuite: Suite = { + id: 'agent-onboarding', + description: 'Behavioral evals for the Novu agent onboarding playbook (npx novu connect).', + systemPrompt: { path: AGENT_ONBOARDING_DOC_PATH }, + systemPromptPreamble: SYSTEM_PROMPT_PREAMBLE, + commandParser: connectParser, + sentinelFilePatterns: [/NOVU_CONNECT_AUTH_URL_FILE=(\S+)/], + followUpTextPattern: /paste.*token|configuration token|xoxe\.xoxp/i, + onTrackedCommand: (_command, parsed, recorder) => { + if (parsed.description) { + recorder.setMetadata('description', parsed.description); + } + }, + scenarios: [ + { scenario: keylessSlackSecureScenario, graders: keylessSlackSecureGraders }, + { scenario: dashboardPromptLoginScenario, graders: dashboardPromptLoginGraders }, + { scenario: keylessWhatsappRedirectScenario, graders: keylessWhatsappRedirectGraders }, + { scenario: emailHandoffScenario, graders: emailHandoffGraders }, + { scenario: telegramSecureQrScenario, graders: telegramSecureQrGraders }, + { scenario: slackInChatRerunScenario, graders: slackInChatRerunGraders }, + { scenario: personaInfraExclusionScenario, graders: personaInfraExclusionGraders }, + { scenario: disciplineNoTimersScenario, graders: disciplineNoTimersGraders }, + ], +}; diff --git a/libs/agent-evals/src/suites/agent-onboarding/kit.ts b/libs/agent-evals/src/suites/agent-onboarding/kit.ts new file mode 100644 index 00000000000..ea879f50af7 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/kit.ts @@ -0,0 +1,6 @@ +// Stable import surface for scenario files, independent of core/ layout. +export { defineGraders, labeled, toolCallsNamed } from '../../core/graders.js'; +export type { EvalScenario, RunResult } from '../../core/types.js'; +export { catalog, sharedJudgeGraders } from './catalog.js'; +export type { ConnectFlags } from './connect-parser.js'; +export { buildDefaultTape, connectTape } from './tape.js'; diff --git a/libs/agent-evals/src/suites/agent-onboarding/onboarding.eval.ts b/libs/agent-evals/src/suites/agent-onboarding/onboarding.eval.ts new file mode 100644 index 00000000000..0d8abd6fed0 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/onboarding.eval.ts @@ -0,0 +1,31 @@ +import '../../load-env.js'; +import { describeEval } from 'vitest-evals'; +import { gradersToJudges } from './adapters.js'; +import { loadSuiteSystemPrompt, scenarioHarness } from './harness.js'; +import { agentOnboardingSuite } from './index.js'; + +const JUDGE_THRESHOLD = 0.8; +const system = await loadSuiteSystemPrompt(agentOnboardingSuite); + +for (const entry of agentOnboardingSuite.scenarios) { + const harness = scenarioHarness({ + suite: agentOnboardingSuite, + scenario: entry.scenario, + system, + }); + + describeEval( + entry.scenario.id, + { + harness, + judges: gradersToJudges(entry.graders), + judgeThreshold: JUDGE_THRESHOLD, + skipIf: () => !process.env.ANTHROPIC_API_KEY, + }, + (it) => { + it(entry.scenario.description, async ({ run }) => { + await run(entry.scenario.userPrompt); + }); + } + ); +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/graders.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/graders.ts new file mode 100644 index 00000000000..f64bda6db40 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/graders.ts @@ -0,0 +1,16 @@ +import { catalog, defineGraders, labeled, sharedJudgeGraders } from '../../kit.js'; + +export const graders = defineGraders({ + usedDashboardOAuthWhenPrompted: labeled( + 'uses dashboard OAuth (omits --keyless) when the user is signed into the dashboard', + catalog.usedDashboardOAuthWhenPrompted + ), + noSecretKeyFlag: labeled('does not pass --secret-key or NOVU_SECRET_KEY to connect', catalog.noSecretKeyFlag), + backgroundConnectShell: labeled( + 'runs connect in the background and polls output with BashOutput', + catalog.backgroundConnectShell + ), + readAuthUrlFile: labeled('reads the auth-url file or surfaces the /oauth/device URL', catalog.readAuthUrlFile), + reportedSuccess: labeled('confirms the agent is live in the final report', catalog.reportedSuccess), + ...sharedJudgeGraders, +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/README.md b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/README.md new file mode 100644 index 00000000000..70a890d9daa --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/README.md @@ -0,0 +1,3 @@ +# Acme Support + +Acme helps shoppers track orders and billing questions. diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/novu-connect-auth-url.txt b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/novu-connect-auth-url.txt new file mode 100644 index 00000000000..1596d1d33d6 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/novu-connect-auth-url.txt @@ -0,0 +1 @@ +https://dashboard.novu.test/oauth/device/login-abc diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/package.json b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/package.json new file mode 100644 index 00000000000..39c7aa062c7 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/package.json @@ -0,0 +1,4 @@ +{ + "name": "acme-support", + "description": "Customer support tooling for Acme shoppers" +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/scenario.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/scenario.ts new file mode 100644 index 00000000000..6c8ddf00308 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/scenario.ts @@ -0,0 +1,40 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { type ConnectFlags, connectTape, type EvalScenario } from '../../kit.js'; + +const scenarioDir = path.dirname(fileURLToPath(import.meta.url)); + +export const scenario: EvalScenario = { + id: 'dashboard-prompt-login', + category: 'authenticated', + description: 'Dashboard prompt must use dashboard OAuth (omit --keyless) and deliver auth URL from file.', + userPrompt: + "I'm signed in to the Novu dashboard. Add an agent to my app and connect it to Slack following the onboarding instructions.", + projectRoot: path.join(scenarioDir, 'project'), + scriptedAnswers: [ + { questionContains: 'channel', optionId: 'slack' }, + { questionContains: 'description', optionId: 'approve' }, + { questionContains: 'token', optionId: 'secure' }, + ], + tape: connectTape({ + requireNoKeyless: true, + allowedChannels: ['slack'], + chunks: [ + { + stdout: `NOVU_CONNECT_AUTH_URL_FILE=${path.join(scenarioDir, 'project/novu-connect-auth-url.txt')}`, + }, + { stdout: 'NOVU_CONNECT_SLACK_SETUP_URL=https://setup.novu.test/slack/login-1' }, + { stdout: 'NOVU_CONNECT_SLACK_CONFIG_TOKEN_SAVED=1' }, + { stdout: 'NOVU_CONNECT_SLACK_AUTHORIZE_URL=https://slack.test/oauth/login-1' }, + { + stdout: [ + '✓ Your agent is live.', + ' Agent: Dashboard Agent (dash-agent-1)', + ' → Check Slack — your agent just messaged you.', + ' Dashboard: https://dashboard.novu.test/agents/dash-agent-1', + ].join('\n'), + }, + ], + exitCode: 0, + }), +}; diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/graders.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/graders.ts new file mode 100644 index 00000000000..2eb12990095 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/graders.ts @@ -0,0 +1,20 @@ +import { catalog, defineGraders, labeled, type RunResult, toolCallsNamed } from '../../kit.js'; + +// Count actual BashOutput poll calls, not `polledShellIds` — the recorder dedupes the latter +// by shell id, so a correct agent polling one shell repeatedly would otherwise score as a +// single poll. +function polledAtLeast(result: RunResult, count: number): 'pass' | 'fail' { + return toolCallsNamed(result, 'BashOutput').length >= count ? 'pass' : 'fail'; +} + +export const graders = defineGraders({ + noTimersNoWatchers: labeled('does not use timer/watcher commands or tail log files', catalog.noTimersNoWatchers), + backgroundConnectShell: labeled( + 'runs connect in the background and polls output with BashOutput', + catalog.backgroundConnectShell + ), + polledMultipleTimes: labeled('polls the background connect shell at least three times', (result) => + polledAtLeast(result, 3) + ), + reportedSuccess: labeled('confirms the agent is live in the final report', catalog.reportedSuccess), +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/README.md b/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/README.md new file mode 100644 index 00000000000..d8f969e0fb1 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/README.md @@ -0,0 +1,3 @@ +# Discipline Demo + +A simple project for testing Novu connect shell polling discipline. diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/package.json b/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/package.json new file mode 100644 index 00000000000..0a03cad85d5 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/package.json @@ -0,0 +1,4 @@ +{ + "name": "discipline-demo", + "description": "Demo project for connect shell discipline" +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts new file mode 100644 index 00000000000..1ec62a810a8 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts @@ -0,0 +1,37 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { type ConnectFlags, connectTape, type EvalScenario } from '../../kit.js'; + +const scenarioDir = path.dirname(fileURLToPath(import.meta.url)); + +export const scenario: EvalScenario = { + id: 'discipline-no-timers', + category: 'discipline', + description: 'Agent must poll BashOutput repeatedly without sleep/tail/grep watchers.', + userPrompt: 'Connect a Novu agent to Slack for this project.', + projectRoot: path.join(scenarioDir, 'project'), + scriptedAnswers: [ + { questionContains: 'channel', optionId: 'slack' }, + { questionContains: 'description', optionId: 'approve' }, + { questionContains: 'token', optionId: 'secure' }, + ], + tape: connectTape({ + allowedChannels: ['slack'], + chunks: [ + { stdout: 'NOVU_CONNECT_SLACK_SETUP_URL=https://setup.novu.test/slack/discipline-1' }, + { stdout: 'Waiting for Slack App Configuration Token...' }, + { stdout: 'Still waiting for Slack App Configuration Token...' }, + { stdout: 'NOVU_CONNECT_SLACK_CONFIG_TOKEN_SAVED=1' }, + { stdout: 'NOVU_CONNECT_SLACK_AUTHORIZE_URL=https://slack.test/oauth/discipline-1' }, + { stdout: 'Waiting for Slack OAuth...' }, + { + stdout: [ + '✓ Your agent is live.', + ' Agent: Discipline Agent (discipline-agent-1)', + ' Claim your agent: https://dashboard.novu.test/claim/discipline-token', + ].join('\n'), + }, + ], + exitCode: 0, + }), +}; diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/graders.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/graders.ts new file mode 100644 index 00000000000..e6ffa39d153 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/graders.ts @@ -0,0 +1,14 @@ +import { catalog, defineGraders, labeled, sharedJudgeGraders } from '../../kit.js'; + +const mailtoUrl = 'mailto:connect+agent123@inbound.novu.test?subject=Novu%20Connect'; + +export const graders = defineGraders({ + noSecretKeyFlag: labeled('does not pass --secret-key or NOVU_SECRET_KEY to connect', catalog.noSecretKeyFlag), + backgroundConnectShell: labeled( + 'runs connect in the background and polls output with BashOutput', + catalog.backgroundConnectShell + ), + pastedMailto: labeled('surfaces the mailto handoff URL to the user', catalog.pastedLiteralUrl(mailtoUrl)), + reportedSuccess: labeled('confirms the agent is live in the final report', catalog.reportedSuccess), + ...sharedJudgeGraders, +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/README.md b/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/README.md new file mode 100644 index 00000000000..d5d1fdeecb2 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/README.md @@ -0,0 +1,3 @@ +# Inbox Helper + +Inbox Helper answers member questions over email. diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/package.json b/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/package.json new file mode 100644 index 00000000000..7019f8d3699 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/package.json @@ -0,0 +1,4 @@ +{ + "name": "inbox-helper", + "description": "Email assistant for members" +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/scenario.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/scenario.ts new file mode 100644 index 00000000000..294bea6fb52 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/scenario.ts @@ -0,0 +1,37 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { type ConnectFlags, connectTape, type EvalScenario } from '../../kit.js'; + +const scenarioDir = path.dirname(fileURLToPath(import.meta.url)); + +const inboundAddress = 'connect+agent123@inbound.novu.test'; +const mailtoUrl = `mailto:${inboundAddress}?subject=Novu%20Connect`; + +export const scenario: EvalScenario = { + id: 'email-handoff', + category: 'keyless', + description: 'Email channel delivers mailto and inbound address handoffs.', + userPrompt: 'Set up a Novu agent on email for this project.', + projectRoot: path.join(scenarioDir, 'project'), + scriptedAnswers: [ + { questionContains: 'channel', optionId: 'email' }, + { questionContains: 'description', optionId: 'approve' }, + ], + tape: connectTape({ + requireKeyless: true, + allowedChannels: ['email'], + chunks: [ + { stdout: `NOVU_CONNECT_INBOUND_ADDRESS=${inboundAddress}` }, + { stdout: `NOVU_CONNECT_MAILTO=${mailtoUrl}` }, + { + stdout: [ + '✓ Your agent is live.', + ' Agent: Email Agent (email-agent-1)', + ' → Check Email — your agent just messaged you.', + ' Claim your agent: https://dashboard.novu.test/claim/email-token', + ].join('\n'), + }, + ], + exitCode: 0, + }), +}; diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/graders.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/graders.ts new file mode 100644 index 00000000000..8294ec6d845 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/graders.ts @@ -0,0 +1,27 @@ +import { catalog, defineGraders, labeled, sharedJudgeGraders } from '../../kit.js'; + +const setupUrl = 'https://setup.novu.test/slack/abc123'; +const authorizeUrl = 'https://slack.test/oauth/authorize/xyz'; +const claimUrl = 'https://dashboard.novu.test/claim/token-abc'; + +export const graders = defineGraders({ + noSecretKeyFlag: labeled('does not pass --secret-key or NOVU_SECRET_KEY to connect', catalog.noSecretKeyFlag), + backgroundConnectShell: labeled( + 'runs connect in the background and polls output with BashOutput', + catalog.backgroundConnectShell + ), + usedPickerForDecisions: labeled( + 'uses AskUserQuestion for channel and token decisions', + catalog.usedPickerForDecisions + ), + confirmedBeforeRun: labeled('confirms with the user before running connect', catalog.confirmedBeforeRun), + usedSecureTokenPath: labeled( + 'uses the secure token path instead of passing --slack-config-token inline', + catalog.usedSecureTokenPath + ), + pastedSetupUrl: labeled('surfaces the Slack setup URL to the user', catalog.pastedLiteralUrl(setupUrl)), + pastedAuthorizeUrl: labeled('surfaces the Slack authorize URL to the user', catalog.pastedLiteralUrl(authorizeUrl)), + reportedClaimLink: labeled('surfaces the claim link to the user', catalog.pastedLiteralUrl(claimUrl)), + reportedSuccess: labeled('confirms the agent is live in the final report', catalog.reportedSuccess), + ...sharedJudgeGraders, +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/README.md b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/README.md new file mode 100644 index 00000000000..f42778f6278 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/README.md @@ -0,0 +1,5 @@ +# Cellar + +Cellar is a wine bar inventory app for staff to check stock levels, par, vendor details, purchase orders, and invoices. + +The audience is wine bar staff, not developers. diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/package.json b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/package.json new file mode 100644 index 00000000000..26237231fb5 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/package.json @@ -0,0 +1,9 @@ +{ + "name": "cellar-inventory", + "description": "Inventory management for Cellar wine bar staff", + "keywords": [ + "wine", + "inventory", + "hospitality" + ] +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/scenario.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/scenario.ts new file mode 100644 index 00000000000..0a127fe5c57 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/scenario.ts @@ -0,0 +1,21 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { buildDefaultTape, type ConnectFlags, type EvalScenario } from '../../kit.js'; + +const scenarioDir = path.dirname(fileURLToPath(import.meta.url)); + +export const scenario: EvalScenario = { + id: 'keyless-slack-secure', + category: 'keyless', + description: 'Keyless Slack secure setup path with background shell polling.', + userPrompt: 'Help me connect a Novu managed agent to Slack for this project.', + projectRoot: path.join(scenarioDir, 'project'), + scriptedAnswers: [ + { questionContains: 'channel', optionId: 'slack' }, + { questionContains: 'description', optionId: 'approve' }, + { questionContains: 'token', optionId: 'secure' }, + ], + tape: buildDefaultTape({ + allowedChannels: ['slack'], + }), +}; diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/graders.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/graders.ts new file mode 100644 index 00000000000..9e83f600caf --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/graders.ts @@ -0,0 +1,9 @@ +import { catalog, defineGraders, labeled } from '../../kit.js'; + +export const graders = defineGraders({ + noConnectCommands: labeled('does not run a connect command', catalog.noConnectCommands), + noConnectOnKeylessWhatsapp: labeled( + 'redirects the user to the dashboard instead of running connect', + catalog.noConnectOnKeylessWhatsapp + ), +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/README.md b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/README.md new file mode 100644 index 00000000000..cfcabf50b1a --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/README.md @@ -0,0 +1,3 @@ +# Shop Chat + +Shop Chat helps customers buy products over WhatsApp. diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/package.json b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/package.json new file mode 100644 index 00000000000..c08b9e136c7 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/package.json @@ -0,0 +1,4 @@ +{ + "name": "shop-chat", + "description": "WhatsApp shopping assistant" +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/scenario.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/scenario.ts new file mode 100644 index 00000000000..98dbdbd13bc --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/scenario.ts @@ -0,0 +1,14 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { type ConnectFlags, type EvalScenario } from '../../kit.js'; + +const scenarioDir = path.dirname(fileURLToPath(import.meta.url)); + +export const scenario: EvalScenario = { + id: 'keyless-whatsapp-redirect', + category: 'keyless', + description: 'Keyless WhatsApp/Teams must redirect to dashboard without running connect.', + userPrompt: 'Connect a Novu agent to WhatsApp for this project.', + projectRoot: path.join(scenarioDir, 'project'), + scriptedAnswers: [{ questionContains: 'channel', optionId: 'dashboard' }], +}; diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/graders.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/graders.ts new file mode 100644 index 00000000000..867d474fb07 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/graders.ts @@ -0,0 +1,14 @@ +import { catalog, defineGraders, labeled, sharedJudgeGraders } from '../../kit.js'; + +export const graders = defineGraders({ + descriptionExcludesInfraTokens: labeled( + 'excludes infrastructure tokens from the drafted agent description', + catalog.descriptionExcludesInfraTokens(['postgres', 'resend', 'mongodb', 'github', 'sentry']) + ), + descriptionIncludesAudience: labeled( + 'includes audience-specific tokens in the drafted agent description', + catalog.descriptionIncludesTokens(['staff', 'wine', 'bartender', 'sommelier', 'waitstaff', 'hospitality']) + ), + confirmedBeforeRun: labeled('confirms with the user before running connect', catalog.confirmedBeforeRun), + ...sharedJudgeGraders, +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/README.md b/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/README.md new file mode 100644 index 00000000000..4ca3df73f98 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/README.md @@ -0,0 +1,7 @@ +# Cellar Backend + +Cellar uses PostgreSQL for inventory storage and Resend for transactional email delivery. + +Cellar's wine bar staff use the app to check stock levels, par, vendor details, purchase orders, and invoices. + +The end users are wine bar staff, not developers. diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/package.json b/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/package.json new file mode 100644 index 00000000000..c23f7f9a951 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/package.json @@ -0,0 +1,8 @@ +{ + "name": "cellar-backend", + "description": "Wine bar inventory platform", + "dependencies": { + "pg": "^8.0.0", + "resend": "^4.0.0" + } +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/scenario.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/scenario.ts new file mode 100644 index 00000000000..b923b0cdf27 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/scenario.ts @@ -0,0 +1,35 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { type ConnectFlags, connectTape, type EvalScenario } from '../../kit.js'; + +const scenarioDir = path.dirname(fileURLToPath(import.meta.url)); + +export const scenario: EvalScenario = { + id: 'persona-infra-exclusion', + category: 'inference', + description: 'Agent description must exclude infra tokens and name the end-user audience.', + userPrompt: 'Connect a Novu agent to Slack for this project.', + projectRoot: path.join(scenarioDir, 'project'), + scriptedAnswers: [ + { questionContains: 'channel', optionId: 'slack' }, + { questionContains: 'description', optionId: 'approve' }, + { questionContains: 'token', optionId: 'secure' }, + ], + tape: connectTape({ + requireKeyless: true, + allowedChannels: ['slack'], + chunks: [ + { stdout: 'NOVU_CONNECT_SLACK_SETUP_URL=https://setup.novu.test/slack/persona-1' }, + { stdout: 'NOVU_CONNECT_SLACK_CONFIG_TOKEN_SAVED=1' }, + { stdout: 'NOVU_CONNECT_SLACK_AUTHORIZE_URL=https://slack.test/oauth/persona-1' }, + { + stdout: [ + '✓ Your agent is live.', + ' Agent: Persona Agent (persona-agent-1)', + ' Claim your agent: https://dashboard.novu.test/claim/persona-token', + ].join('\n'), + }, + ], + exitCode: 0, + }), +}; diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/graders.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/graders.ts new file mode 100644 index 00000000000..1a6b7669d1a --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/graders.ts @@ -0,0 +1,16 @@ +import { catalog, defineGraders, labeled, sharedJudgeGraders } from '../../kit.js'; + +export const graders = defineGraders({ + usedDashboardOAuthWhenPrompted: labeled( + 'uses dashboard OAuth (omits --keyless) when the user is signed into the dashboard', + catalog.usedDashboardOAuthWhenPrompted + ), + killedFirstConnectShell: labeled('kills the first connect shell before re-running', catalog.killedFirstConnectShell), + reranWithSlackToken: labeled('re-runs connect with --slack-config-token', catalog.reranWithSlackToken), + pastedAuthorizeUrl: labeled( + 'surfaces the Slack authorize URL to the user', + catalog.pastedLiteralUrl('https://slack.test/oauth/rerun-token') + ), + reportedSuccess: labeled('confirms the agent is live in the final report', catalog.reportedSuccess), + ...sharedJudgeGraders, +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/README.md b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/README.md new file mode 100644 index 00000000000..f7be4c97cad --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/README.md @@ -0,0 +1,3 @@ +# Ops Slack + +Ops Slack helps operations staff coordinate daily tasks. diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/novu-connect-auth-url.txt b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/novu-connect-auth-url.txt new file mode 100644 index 00000000000..f05e71fc560 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/novu-connect-auth-url.txt @@ -0,0 +1 @@ +https://dashboard.novu.test/oauth/device/rerun-abc diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/package.json b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/package.json new file mode 100644 index 00000000000..a3d7043f94f --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/package.json @@ -0,0 +1,4 @@ +{ + "name": "ops-slack", + "description": "Slack assistant for operations staff" +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts new file mode 100644 index 00000000000..26c811e8fc7 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts @@ -0,0 +1,51 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { type ConnectFlags, connectTape, type EvalScenario } from '../../kit.js'; + +const scenarioDir = path.dirname(fileURLToPath(import.meta.url)); + +export const scenario: EvalScenario = { + id: 'slack-in-chat-rerun', + category: 'authenticated', + description: 'Slack in_chat path kills first shell and reruns with --slack-config-token.', + userPrompt: "I'm signed in to the Novu dashboard. Connect my agent to Slack.", + projectRoot: path.join(scenarioDir, 'project'), + scriptedAnswers: [ + { questionContains: 'channel', optionId: 'slack' }, + { questionContains: 'description', optionId: 'approve' }, + { questionContains: 'token', optionId: 'in_chat' }, + ], + followUpMessages: ['Here is my Slack App Configuration Token: xoxe.xoxp-test-token'], + followUpOnOptionId: 'in_chat', + tape: connectTape({ + requireNoKeyless: true, + allowedChannels: ['slack'], + // The first (no-token) connect run mirrors the real CLI: it prints the Slack setup + // URL and then waits for the config token, so it stays running until the agent kills + // it. Only the re-run that supplies `--slack-config-token` exits on its own. + pendingWhen: (flags) => !flags.slackConfigToken, + chunks: [ + { + stdout: `NOVU_CONNECT_AUTH_URL_FILE=${path.join(scenarioDir, 'project/novu-connect-auth-url.txt')}`, + }, + { + stdout: 'NOVU_CONNECT_SLACK_SETUP_URL=https://setup.novu.test/slack/rerun-1', + when: (flags) => !flags.slackConfigToken, + }, + { + stdout: 'NOVU_CONNECT_SLACK_AUTHORIZE_URL=https://slack.test/oauth/rerun-token', + when: (flags) => Boolean(flags.slackConfigToken), + }, + { + stdout: [ + '✓ Your agent is live.', + ' Agent: Slack Rerun Agent (slack-rerun-1)', + ' → Check Slack — your agent just messaged you.', + ' Dashboard: https://dashboard.novu.test/agents/slack-rerun-1', + ].join('\n'), + when: (flags) => Boolean(flags.slackConfigToken), + }, + ], + exitCode: 0, + }), +}; diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/graders.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/graders.ts new file mode 100644 index 00000000000..a90c3f93bd6 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/graders.ts @@ -0,0 +1,16 @@ +import { catalog, defineGraders, labeled, sharedJudgeGraders } from '../../kit.js'; + +export const graders = defineGraders({ + noSecretKeyFlag: labeled('does not pass --secret-key or NOVU_SECRET_KEY to connect', catalog.noSecretKeyFlag), + backgroundConnectShell: labeled( + 'runs connect in the background and polls output with BashOutput', + catalog.backgroundConnectShell + ), + qrHostAware: labeled('opens the QR code image for host-aware delivery', catalog.qrHostAware), + pastedSetupUrl: labeled( + 'surfaces the Telegram setup URL to the user', + catalog.pastedLiteralUrl('https://setup.novu.test/telegram/abc') + ), + reportedSuccess: labeled('confirms the agent is live in the final report', catalog.reportedSuccess), + ...sharedJudgeGraders, +}); diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/README.md b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/README.md new file mode 100644 index 00000000000..720e5ceb2b1 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/README.md @@ -0,0 +1,3 @@ +# Cellar Telegram + +Cellar helps guests ask wine questions on Telegram. diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/package.json b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/package.json new file mode 100644 index 00000000000..2a330626652 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/package.json @@ -0,0 +1,4 @@ +{ + "name": "cellar-telegram", + "description": "Telegram support bot for wine bar guests" +} diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/telegram-setup-qr.png b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/telegram-setup-qr.png new file mode 100644 index 00000000000..087b77518d5 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/telegram-setup-qr.png @@ -0,0 +1 @@ +Telegram setup QR placeholder diff --git a/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.ts b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.ts new file mode 100644 index 00000000000..6a6d27ded04 --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.ts @@ -0,0 +1,40 @@ +import path from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { type ConnectFlags, connectTape, type EvalScenario } from '../../kit.js'; + +const scenarioDir = path.dirname(fileURLToPath(import.meta.url)); +const qrPath = path.join(scenarioDir, 'project/telegram-setup-qr.png'); + +export const scenario: EvalScenario = { + id: 'telegram-secure-qr', + category: 'keyless', + description: 'Telegram secure setup with host-aware QR delivery via open.', + userPrompt: 'Connect a Novu agent to Telegram for this project.', + projectRoot: path.join(scenarioDir, 'project'), + scriptedAnswers: [ + { questionContains: 'channel', optionId: 'telegram' }, + { questionContains: 'description', optionId: 'approve' }, + { questionContains: 'token', optionId: 'secure' }, + ], + tape: connectTape({ + requireKeyless: true, + allowedChannels: ['telegram'], + chunks: [ + { stdout: 'NOVU_CONNECT_TELEGRAM_BOTFATHER_URL=https://t.me/botfather' }, + { stdout: 'NOVU_CONNECT_TELEGRAM_SETUP_URL=https://setup.novu.test/telegram/abc' }, + { stdout: `NOVU_CONNECT_TELEGRAM_SETUP_QR_PNG=${qrPath}` }, + { stdout: 'NOVU_CONNECT_TELEGRAM_DEEPLINK_URL=https://t.me/cellar_support_bot?start=connect' }, + { stdout: 'NOVU_CONNECT_TELEGRAM_BOT_USERNAME=cellar_support_bot' }, + { stdout: `NOVU_CONNECT_TELEGRAM_DEEPLINK_QR_PNG=${qrPath}` }, + { + stdout: [ + '✓ Your agent is live.', + ' Agent: Telegram Agent (telegram-agent-1)', + ' → Check Telegram — your agent just messaged you.', + ' Claim your agent: https://dashboard.novu.test/claim/telegram-token', + ].join('\n'), + }, + ], + exitCode: 0, + }), +}; diff --git a/libs/agent-evals/src/suites/agent-onboarding/tape.ts b/libs/agent-evals/src/suites/agent-onboarding/tape.ts new file mode 100644 index 00000000000..91701ce7b3f --- /dev/null +++ b/libs/agent-evals/src/suites/agent-onboarding/tape.ts @@ -0,0 +1,50 @@ +import type { Tape, TapeChunk } from '../../core/types.js'; +import { type ConnectFlags, type ConnectValidationOptions, connectValidate } from './connect-parser.js'; + +export type ConnectTapeOptions = ConnectValidationOptions & { + chunks: Array>; + exitCode?: number; + /** Keep the shell running until killed for the branches this predicate matches. */ + pendingWhen?: (flags: ConnectFlags) => boolean; +}; + +/** Build a connect tape, wiring connect-specific validation into the generic `validate` hook. */ +export function connectTape(options: ConnectTapeOptions): Tape { + return { + chunks: options.chunks, + exitCode: options.exitCode ?? 0, + pendingWhen: options.pendingWhen, + validate: connectValidate({ + requireKeyless: options.requireKeyless, + requireNoKeyless: options.requireNoKeyless, + allowedChannels: options.allowedChannels, + }), + }; +} + +/** Default keyless Slack tape used by the canonical scenario. */ +export function buildDefaultTape(overrides?: Partial): Tape { + const defaultChunks: Array> = [ + { stdout: 'NOVU_CONNECT_SLACK_SETUP_URL=https://setup.novu.test/slack/abc123' }, + { stdout: 'NOVU_CONNECT_SLACK_CONFIG_TOKEN_SAVED=1' }, + { stdout: 'NOVU_CONNECT_SLACK_AUTHORIZE_URL=https://slack.test/oauth/authorize/xyz' }, + { + stdout: [ + '✓ Your agent is live.', + ' Agent: Demo Agent (demo-agent-1)', + ' → Check Slack — your agent just messaged you.', + ' Claim your agent: https://dashboard.novu.test/claim/token-abc', + ].join('\n'), + }, + ]; + + return connectTape({ + chunks: overrides?.chunks ?? defaultChunks, + exitCode: overrides?.exitCode ?? 0, + // The default tape models the keyless flow, so require `--keyless` unless the caller + // explicitly opts into the dashboard-OAuth (no-keyless) path. + requireKeyless: overrides?.requireKeyless ?? !overrides?.requireNoKeyless, + allowedChannels: overrides?.allowedChannels ?? ['slack'], + requireNoKeyless: overrides?.requireNoKeyless, + }); +} diff --git a/libs/agent-evals/tsconfig.json b/libs/agent-evals/tsconfig.json new file mode 100644 index 00000000000..d9fbdf9a4e1 --- /dev/null +++ b/libs/agent-evals/tsconfig.json @@ -0,0 +1,17 @@ +{ + "extends": "../../tsconfig.json", + "compilerOptions": { + "module": "ESNext", + "moduleResolution": "bundler", + "target": "ES2022", + "lib": ["ES2022"], + "strict": true, + "strictNullChecks": true, + "esModuleInterop": true, + "skipLibCheck": true, + "resolveJsonModule": true, + "noEmit": true, + "types": ["node"] + }, + "include": ["src/**/*.ts"] +} diff --git a/libs/agent-evals/vitest.config.ts b/libs/agent-evals/vitest.config.ts new file mode 100644 index 00000000000..d08f3fd0e06 --- /dev/null +++ b/libs/agent-evals/vitest.config.ts @@ -0,0 +1,8 @@ +import { defineConfig } from 'vitest/config'; + +export default defineConfig({ + test: { + include: ['src/**/*.test.ts'], + testTimeout: 30_000, + }, +}); diff --git a/libs/agent-evals/vitest.evals.config.ts b/libs/agent-evals/vitest.evals.config.ts new file mode 100644 index 00000000000..e7382ba544a --- /dev/null +++ b/libs/agent-evals/vitest.evals.config.ts @@ -0,0 +1,24 @@ +import { defineConfig } from 'vitest/config'; + +const concurrency = Number.parseInt(process.env.NOVU_EVAL_CONCURRENCY ?? '', 10); +const maxConcurrency = Number.isFinite(concurrency) && concurrency > 0 ? concurrency : 4; + +export default defineConfig({ + test: { + include: ['src/**/*.eval.ts'], + testTimeout: 300_000, + hookTimeout: 60_000, + // vitest-evals/reporter extends VerboseReporter and prints compact, human-readable + // per-grader scores + reasons. The stock 'default' reporter additionally dumps the + // full RunResult JSON inside the threshold AssertionError, so we omit it here. + reporters: ['vitest-evals/reporter'], + // Scenarios are independent and dominated by live-model latency, so run them + // concurrently. maxConcurrency caps in-flight requests to respect API rate limits. + sequence: { concurrent: true }, + maxConcurrency, + env: { + VITEST_EVALS_REPLAY_MODE: process.env.VITEST_EVALS_REPLAY_MODE ?? 'off', + VITEST_EVALS_REPLAY_DIR: '.vitest-evals/recordings', + }, + }, +}); diff --git a/packages/chat-adapter/src/index.ts b/packages/chat-adapter/src/index.ts index 941aec5cbf5..2be0401a64f 100644 --- a/packages/chat-adapter/src/index.ts +++ b/packages/chat-adapter/src/index.ts @@ -1,8 +1,8 @@ -import { NovuAdapterImpl } from "./adapter.js"; -import type { NovuAdapter, NovuAdapterConfig } from "./types.js"; +import { NovuAdapterImpl } from './adapter.js'; +import type { NovuAdapter, NovuAdapterConfig } from './types.js'; -export { getNovuContext } from "./novu-context.js"; -export { verifyNovuSignature } from "./signature.js"; +export { getNovuContext } from './novu-context.js'; +export { verifyNovuSignature } from './signature.js'; export type { AddReactionPayload, @@ -30,8 +30,8 @@ export type { ReplyFileRef, Signal, TriggerRecipientsPayload, -} from "./types.js"; -export { AgentEvent } from "./types.js"; +} from './types.js'; +export { AgentEvent } from './types.js'; /** * Create a Chat SDK adapter that exposes Novu's normalized chat channels @@ -61,10 +61,8 @@ export { AgentEvent } from "./types.js"; * await thread.post(`echo: ${message.text}`); * }); */ -export function createNovuAdapter( - config: Partial = {}, -): NovuAdapter { - const env = typeof process !== "undefined" ? process.env : undefined; +export function createNovuAdapter(config: Partial = {}): NovuAdapter { + const env = typeof process !== 'undefined' ? process.env : undefined; const apiKey = config.apiKey ?? env?.NOVU_SECRET_KEY; const bridgeSecret = config.bridgeSecret ?? env?.NOVU_SECRET_KEY; const agentIdentifier = config.agentIdentifier ?? env?.NOVU_AGENT_IDENTIFIER; @@ -72,19 +70,13 @@ export function createNovuAdapter( const bridgeUrl = config.bridgeUrl ?? env?.NOVU_BRIDGE_URL; if (!apiKey) { - throw new Error( - "createNovuAdapter: `apiKey` is required (pass it or set NOVU_SECRET_KEY).", - ); + throw new Error('createNovuAdapter: `apiKey` is required (pass it or set NOVU_SECRET_KEY).'); } if (!agentIdentifier) { - throw new Error( - "createNovuAdapter: `agentIdentifier` is required (pass it or set NOVU_AGENT_IDENTIFIER).", - ); + throw new Error('createNovuAdapter: `agentIdentifier` is required (pass it or set NOVU_AGENT_IDENTIFIER).'); } if (!bridgeSecret) { - throw new Error( - "createNovuAdapter: `bridgeSecret` is required (pass it or set NOVU_SECRET_KEY).", - ); + throw new Error('createNovuAdapter: `bridgeSecret` is required (pass it or set NOVU_SECRET_KEY).'); } return new NovuAdapterImpl({ diff --git a/packages/shared/docs/agent-onboarding.md b/packages/shared/docs/agent-onboarding.md index f4aae7da67f..c47ef083543 100644 --- a/packages/shared/docs/agent-onboarding.md +++ b/packages/shared/docs/agent-onboarding.md @@ -31,7 +31,7 @@ These govern every step. When in doubt, follow these over any specific instructi - **Trust user intent; ask only when genuinely unclear.** Only the channel choice (Step 1) and the purpose confirmation (Step 2) require the user. Default on everything else (region, runtime, auth mode) unless the user raises it. - **Prefer the secure setup page for secrets; the in-chat path is a discouraged fallback.** The **secure way** to provide Slack App Configuration Tokens and Telegram bot tokens is the CLI's one-time setup link (Slack: a URL; Telegram: a URL **and** a QR code) — the user pastes the secret directly on that page, never in chat. Always offer this first and recommend it. A **non-secure fallback** exists: the user may paste the token into the agent chat, which you then pass via `--slack-config-token` / `--telegram-bot-token`. Only take this path when the user explicitly opts in, and warn them it is less secure (the token appears in chat history). - **Confirm before you act.** Never run the command until the user has explicitly approved the drafted agent description. -- **One Connect shell, no log watchers.** Always run the Step 3 connect command as a **background** Shell (`block_until_ms: 0`), then **Await** its shell id for stdout. **Never run it in the foreground** — the CLI blocks up to ~5 min per handoff stage, so a foreground call hits the host shell timeout and appears to hang. Use a single Shell session only. Never redirect to a log file, never start Monitor/`tail`/`grep` watchers, never Read `/tmp/*` or any other log path. **Never use timers** (`ScheduleWakeup`, `sleep`, or "check back in N minutes") to wait for handoffs — **Await** the Connect shell continuously until the next `NOVU_CONNECT_*` sentinel or `✓ Your agent is live` appears. The only exception: `--channel skip` in keyless mode may run in the foreground. +- **One Connect shell, no log watchers.** Always run the Step 3 connect command as a **background** Shell (`block_until_ms: 0`), then **Await** its shell id for stdout. **Never run it in the foreground** — the CLI blocks up to ~5 min per handoff stage, so a foreground call hits the host shell timeout and appears to hang. Use a single Shell session only. Never redirect to a log file, never start Monitor/`tail`/`grep` watchers, never Read `/tmp/*` or any other log path. **Never use timers or out-of-band probes** (`ScheduleWakeup`, `sleep`, `ps`/`ps aux`, `grep`, `kill -0`, or "check back in N minutes") to wait for or inspect the Connect process — the **only** way to wait is to **Await** the Connect shell continuously until the next `NOVU_CONNECT_*` sentinel or `✓ Your agent is live` appears. The only exception: `--channel skip` in keyless mode may run in the foreground. - **The CLI validates handoffs.** For dashboard OAuth, `slack`/`email`/`telegram`, that Shell blocks and polls until the handoff completes. Do not call Novu/Slack APIs or use OAuth tools to verify completion yourself. - **WhatsApp / MS Teams in keyless mode never reach the CLI.** If the user picks one and you are using **`--keyless`** (the default), do **not** run connect — redirect them to the Novu dashboard instead (Step 1). With **dashboard OAuth** (omit `--keyless`), the CLI creates the agent and hands off a dashboard URL to finish channel setup. - **Report conclusion-first.** Lead with the CLI's result (live / failed), then the one action the user must take. Keep it terse. @@ -92,7 +92,7 @@ When the user must pick from a **fixed set** of options (channel, approve/reject | `telegram` | Telegram | Create a bot via @BotFather. **Recommended (secure):** open the setup link/QR the CLI prints and paste the token there. **Non-secure fallback:** paste the token in chat instead and you pass it via `--telegram-bot-token`. Then tap **Start** on the bot in Telegram. | | `dashboard` | WhatsApp / MS Teams | **Keyless (`--keyless`, default):** sign in to the Novu dashboard and continue there (no CLI run). **Dashboard OAuth (omit `--keyless`):** CLI creates the agent, then opens the dashboard to finish channel setup. | -**If they pick `dashboard` and you are using keyless (`--keyless`, the default):** stop — do **not** run connect and do **not** generate an agent. Give the user the dashboard URL — **** (or if they asked for the EU region) — and tell them to **sign in (or sign up) and continue the onboarding from the dashboard**. Steps 2–5 do not apply. +**If they pick `dashboard` and you are using keyless (`--keyless`, the default):** **HARD STOP — never invoke `npx novu connect` in this branch (not in the foreground, not backgrounded, not with any channel flag).** Do **not** run connect and do **not** generate an agent. Give the user the dashboard URL — **** (or if they asked for the EU region) — and tell them to **sign in (or sign up) and continue the onboarding from the dashboard**. Steps 2–5 do not apply. **If they pick `dashboard` and you are using dashboard OAuth (omit `--keyless`):** ask WhatsApp or MS Teams if unclear; use `--channel whatsapp` or `--channel teams` in Step 3. @@ -216,10 +216,12 @@ npx novu@latest connect "$NOVU_AGENT_DESCRIPTION" \ Always start the Connect command as a **background** Shell (`block_until_ms: 0`), then **Await** its shell id for the markers below. This applies to every auth mode and channel. **Never run it in the foreground** — the CLI blocks up to ~5 min per handoff stage and a foreground call will hit the host shell timeout. +**Backgrounding rule (non-negotiable):** use the tool's own background mechanism (`run_in_background: true` / `block_until_ms: 0`) and wait **only** by Await/BashOutput on the returned shell id. **Do NOT** append `&` to the command, **do NOT** `sleep`, and **do NOT** `cat`/`tail`/`grep` any `/tmp/*.log` or other file to inspect progress — the only source of progress is the Connect shell's own stdout polled via Await. + Then follow the path that matches your flags: - **If using dashboard OAuth (omitting `--keyless`):** **Await** `NOVU_CONNECT_AUTH_URL_FILE=` on the background shell id, **Read** that file for the auth URL, deliver the URL to the user, then **Await** channel handoff markers and success on the same shell id. -- **If channel is `slack`, `email`, or `telegram`:** **Await** on that shell id (e.g. `NOVU_CONNECT_SLACK_SETUP_URL=`, `NOVU_CONNECT_INBOUND_ADDRESS=`, etc.). **Await** until `✓ Your agent is live` or `✗`. Do not use Monitor, `tail -f`, `grep`, Read on log files. +- **If channel is `slack`, `email`, or `telegram`:** **Await** on that shell id (e.g. `NOVU_CONNECT_SLACK_SETUP_URL=`, `NOVU_CONNECT_INBOUND_ADDRESS=`, etc.). **Await** until `✓ Your agent is live` or `✗`. Do not use Monitor, `tail -f`, `grep`, `sleep`, `ps`, or Read on log files — poll the shell id and nothing else. - **If channel is `whatsapp` or `teams` (dashboard OAuth only):** **Await** auth URL, then dashboard agent URL or success on the same shell id. - **If channel is `skip` in keyless mode:** foreground Shell is allowed — the only exception to the background rule above. @@ -371,7 +373,7 @@ On success the CLI exits `0` and prints: Claim your agent: # keyless only ``` -**After leading with the CLI's result, give a 1–2 sentence recap** of what onboarding set up — consistent with the conclusion-first operating principle. Before the channel/next-step pointer, briefly explain what the connect run built so the result isn't a black box. Keep it to one or two sentences, in plain language, e.g.: +**Open the final report with the CLI's literal success line** — copy the `✓ Your agent is live.` line verbatim rather than paraphrasing it (e.g. "set up", "connected", "ready"); on failure, lead with the CLI's error instead. **After leading with the CLI's result, give a 1–2 sentence recap** of what onboarding set up — consistent with the conclusion-first operating principle. Before the channel/next-step pointer, briefly explain what the connect run built so the result isn't a black box. Keep it to one or two sentences, in plain language, e.g.: > _"Here's what Novu built from your description: a hosted AI agent — its system prompt, the right tools and skills, MCP servers for the services you named, and a connection to <channel> so it can message your users."_ diff --git a/packages/shared/package.json b/packages/shared/package.json index bdbbf8dc6ac..5eb811fc254 100644 --- a/packages/shared/package.json +++ b/packages/shared/package.json @@ -26,6 +26,7 @@ "types": "dist/cjs/index.d.ts", "files": [ "dist/", + "docs/agent-onboarding.md", "!**/*.spec.*", "!**/*.json", "CHANGELOG.md", @@ -57,7 +58,8 @@ "require": "./dist/cjs/utils/safe-outbound-http.js", "import": "./dist/esm/utils/safe-outbound-http.js", "types": "./dist/esm/utils/safe-outbound-http.d.ts" - } + }, + "./docs/agent-onboarding.md": "./docs/agent-onboarding.md" }, "dependencies": { "lru-cache": "^11.5.1" diff --git a/playground/nextjs/src/app/api/novu-agent/agent.ts b/playground/nextjs/src/app/api/novu-agent/agent.ts index 59c6dc0ec32..a0e1b17253b 100644 --- a/playground/nextjs/src/app/api/novu-agent/agent.ts +++ b/playground/nextjs/src/app/api/novu-agent/agent.ts @@ -29,10 +29,7 @@ function buildDemoCard(platform: string): CardElement { children: [ CardText('This card was posted with `thread.post(Card(...))` and normalized into an agent reply payload.'), Divider(), - Fields([ - Field({ label: 'Platform', value: platform }), - Field({ label: 'Source', value: 'chat-sdk' }), - ]), + Fields([Field({ label: 'Platform', value: platform }), Field({ label: 'Source', value: 'chat-sdk' })]), Section([CardText('Buttons below emit `onAction` callbacks back through the bridge.')]), Actions([ Button({ id: 'card-approve', label: 'Approve', style: 'primary', value: 'approved' }), diff --git a/playground/nextjs/src/app/novu-agent/page.tsx b/playground/nextjs/src/app/novu-agent/page.tsx index 1560e389aad..7c2831c89bf 100644 --- a/playground/nextjs/src/app/novu-agent/page.tsx +++ b/playground/nextjs/src/app/novu-agent/page.tsx @@ -57,12 +57,12 @@ export default function NovuAgentPlayground() {

Novu Chat-adapter playground

- Craft a signed AgentBridgeRequest and run it through the real @novu/chat-sdk-adapter adapter - locally. Reply POSTs are captured instead of sent to Novu — no credentials needed. + Craft a signed AgentBridgeRequest and run it through the real @novu/chat-sdk-adapter{' '} + adapter locally. Reply POSTs are captured instead of sent to Novu — no credentials needed.

- Tip: send the message card (or click Send a card reply) to have the agent post a - chat-sdk Card; whoami echoes the resolved subscriber. + Tip: send the message card (or click Send a card reply) to have the agent post a chat-sdk{' '} + Card; whoami echoes the resolved subscriber.

@@ -178,7 +178,13 @@ const secondaryButtonStyle: React.CSSProperties = { cursor: 'pointer', width: 'fit-content', }; -const labelStyle: React.CSSProperties = { display: 'flex', flexDirection: 'column', gap: 4, fontSize: 14, fontWeight: 600 }; +const labelStyle: React.CSSProperties = { + display: 'flex', + flexDirection: 'column', + gap: 4, + fontSize: 14, + fontWeight: 600, +}; const inputStyle: React.CSSProperties = { padding: '8px 10px', border: '1px solid #ccc', diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index dba4ce8b426..de7d19624ed 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -765,7 +765,7 @@ importers: version: 3.0.51(react@19.2.3)(zod@4.3.5) '@better-auth/sso': specifier: ^1.3.0 - version: 1.4.7(better-auth@1.5.6(e4100f2b29709d66937a4c2872ba4f79)) + version: 1.4.7(better-auth@1.5.6(94b681c845e05f58c362e19ad9fdae01)) '@calcom/embed-react': specifier: 1.5.2 version: 1.5.2(react-dom@19.2.3(react@19.2.3))(react@19.2.3) @@ -975,7 +975,7 @@ importers: version: 6.2.6(react-dom@19.2.3(react@19.2.3))(react@19.2.3) better-auth: specifier: 1.5.6 - version: 1.5.6(e4100f2b29709d66937a4c2872ba4f79) + version: 1.5.6(94b681c845e05f58c362e19ad9fdae01) class-variance-authority: specifier: ^0.7.0 version: 0.7.1 @@ -2132,7 +2132,7 @@ importers: dependencies: '@better-auth/sso': specifier: ^1.4.9 - version: 1.5.6(@better-auth/core@1.5.6(@better-auth/utils@0.3.1)(@better-fetch/fetch@1.1.21)(@opentelemetry/api@1.9.0)(better-call@1.3.2(zod@4.3.6))(jose@6.1.3)(kysely@0.28.17)(nanostores@1.2.0))(@better-auth/utils@0.3.1)(better-auth@1.5.6(e4100f2b29709d66937a4c2872ba4f79))(better-call@1.3.2(zod@4.3.6)) + version: 1.5.6(@better-auth/core@1.5.6(@better-auth/utils@0.3.1)(@better-fetch/fetch@1.1.21)(@opentelemetry/api@1.9.0)(better-call@1.3.2(zod@4.3.6))(jose@6.1.3)(kysely@0.28.17)(nanostores@1.2.0))(@better-auth/utils@0.3.1)(better-auth@1.5.6(94b681c845e05f58c362e19ad9fdae01))(better-call@1.3.2(zod@4.3.6)) '@clerk/backend': specifier: ^3.4.11 version: 3.4.11(react-dom@19.2.3(react@19.2.3))(react@19.2.3) @@ -2171,7 +2171,7 @@ importers: version: link:../../../packages/stateless better-auth: specifier: 1.5.6 - version: 1.5.6(e4100f2b29709d66937a4c2872ba4f79) + version: 1.5.6(94b681c845e05f58c362e19ad9fdae01) better-call: specifier: ^1.3.2 version: 1.3.2(zod@4.3.6) @@ -2468,6 +2468,37 @@ importers: specifier: ^4.49.0 version: 4.68.1 + libs/agent-evals: + dependencies: + '@ai-sdk/anthropic': + specifier: ^3.0.10 + version: 3.0.13(zod@3.25.76) + '@novu/shared': + specifier: workspace:* + version: link:../../packages/shared + ai: + specifier: 6.0.50 + version: 6.0.50(zod@3.25.76) + dotenv: + specifier: ^16.6.1 + version: 16.6.1 + zod: + specifier: ^3.23.8 + version: 3.25.76 + devDependencies: + '@types/node': + specifier: ^22.0.0 + version: 22.15.13 + typescript: + specifier: 5.6.2 + version: 5.6.2 + vitest: + specifier: ^4.1.8 + version: 4.1.9(@edge-runtime/vm@3.0.3)(@opentelemetry/api@1.9.0)(@types/node@22.15.13)(happy-dom@20.8.9)(jsdom@20.0.3)(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3)) + vitest-evals: + specifier: 0.12.0 + version: 0.12.0(ai@6.0.50(zod@3.25.76))(tinyrainbow@3.1.0)(vitest@4.1.9(@edge-runtime/vm@3.0.3)(@opentelemetry/api@1.9.0)(@types/node@22.15.13)(happy-dom@20.8.9)(jsdom@20.0.3)(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3)))(zod@3.25.76) + libs/application-generic: dependencies: '@anthropic-ai/aws-sdk': @@ -15452,9 +15483,18 @@ packages: peerDependencies: vite: ^4.5.10 + '@vitest-evals/core@0.12.0': + resolution: {integrity: sha512-JOatlrVw4jcP9VCBAFcM07pGxUA2iLt4Ks5jaRYqyATjkNwPYnyNDL+YHgvelANfPA0BBX8MzRfs6vEkzJgC+A==} + + '@vitest-evals/report-ui@0.12.0': + resolution: {integrity: sha512-rjWKnB+WL1ekiIvHdcnEX0tfaCwfeG3BNU6jvGKuJsHqkf8JRtuTyy/xgUKKsb56CokcZ3K3hmeo6RKik/KBrQ==} + '@vitest/expect@4.1.7': resolution: {integrity: sha512-1R+tw0ortHEbZDGMymm+pN7/AFQ/RkFFdtd7EN+VBpynKmLbP8A3rpEXdshBJ7+8hQ9zBJh/i1s0yKNtxAnU7w==} + '@vitest/expect@4.1.9': + resolution: {integrity: sha512-vl/rYsUKcBr3SnQn166+XR5ZQcgMx3DQhFWdfli/cWpLnLUmbxZvyrJZotLFUryib+LtArYMSTJ5RbQ57ZqrlA==} + '@vitest/mocker@4.1.7': resolution: {integrity: sha512-vY7nuamKgfvpA1Koa3oYIw/k7D6kZnpGyNMZW8loow2bsBYla1TFdqTaXncWdRn4pgwNs+90RhnXhJScDwQeJA==} peerDependencies: @@ -15466,21 +15506,47 @@ packages: vite: optional: true + '@vitest/mocker@4.1.9': + resolution: {integrity: sha512-EVkXzBjrPGM+cK8/ANWgBrkUCfJfb38/EfTSO8h7pWvKkyPkpWxvR7BkD2MyItMF62C97zAEoqdpUixwR/e+Rw==} + peerDependencies: + msw: ^2.4.9 + vite: ^6.4.3 + peerDependenciesMeta: + msw: + optional: true + vite: + optional: true + '@vitest/pretty-format@4.1.7': resolution: {integrity: sha512-umgCarTOYQWIaDMvGDRZij+6b9oVeLIyJzfN+AS88e0ZOU3QTgNNSTtjQOpcvWr3np1N0j4WgZj+sb3oYBDscw==} + '@vitest/pretty-format@4.1.9': + resolution: {integrity: sha512-s0iufns3iIFitdgm+YR7g1whCAaGtXz459VS9/PqyKDEEFgYIhsHOQmXgIgDuYCt7DeQmiZT0Qe2OA2p4ZPu5A==} + '@vitest/runner@4.1.7': resolution: {integrity: sha512-BapjmAQ2aI78WdMEfeUWivnfVzB+VPGwWRQcJE0OUq7qEeEcBsCSf+0T5iREBNE5nBb4wA5Ya0W6IA+sghdEFw==} + '@vitest/runner@4.1.9': + resolution: {integrity: sha512-KXLMDtc7oe70+3mJfGrPUWPesswH+3sTxAMAMl8DG7I8IUQT4XW718dY5ID3vPUcmlu27CcKfY4P3h3I29SLJg==} + '@vitest/snapshot@4.1.7': resolution: {integrity: sha512-ZacLzja+TmJeZ1h14xW2FB/WpeimUD3haBXQPyJqxvo8jQTmfeA8zv58mtjN2C7EHXZDYVcVYdYmAxjkWVvKCw==} + '@vitest/snapshot@4.1.9': + resolution: {integrity: sha512-Jc7RKGNBo8Z28WYIm0Niej4xdSPByRf6mU58VpHQkd6Zh05rlnA+twjbK5HyeIGHxrzsc3mJgS43uM0CZKzaIA==} + '@vitest/spy@4.1.7': resolution: {integrity: sha512-kbkI5LMWakyuTIvs6fUJ5qdIVb1XVKsYJAT4OJ938cHMROYMSfmoQdZy0aaAnjbbc8F61vkoTqz/Az+/HiIu5Q==} + '@vitest/spy@4.1.9': + resolution: {integrity: sha512-fHpsS6mIi+PiEW+vcRVOMkX1oSaPKne3VOclSFICPcGOmfKgXPU5iAah+wcNcj2xPrCCmfq99IDGf+EojhhvhA==} + '@vitest/utils@4.1.7': resolution: {integrity: sha512-T532WBu791cBxJlCl6SO+J14l81DQx6uQHm1bQbmCDY7nqlEIgkza/UFnSBNaUtSf41unldDFjdOBYEQC4b5Hw==} + '@vitest/utils@4.1.9': + resolution: {integrity: sha512-A51o8ymO5PpqlWNnBP9ZHPXDIpuMtTLlGSjN7la4US+LJzoUMyhwjA5QXlm39JexgwHKW4Xjs8Z2d3dLCXOeuA==} + '@vonage/accounts@1.9.0': resolution: {integrity: sha512-4cW/tfYpL53uHR3YjTbLL/kn23/RllPmFkFf3LAhdvratwtnDSYiOy/nZooATjmon3fzdOYLW0kYGAvoeWlHUg==} @@ -27351,6 +27417,20 @@ packages: vite: optional: true + vitest-evals@0.12.0: + resolution: {integrity: sha512-pyVA4N8gM+T2JB+SGFNSuXcgf/CHbBygAXkXR1fEPEfleKyMacJXPF9gLWIyyC1x5BCrt0r4zkwzkdjZrdpwZQ==} + hasBin: true + peerDependencies: + ai: '>=4 <7' + tinyrainbow: '>=2 <4' + vitest: ^4.1.0 + zod: '>=3 <5' + peerDependenciesMeta: + ai: + optional: true + zod: + optional: true + vitest@4.1.7: resolution: {integrity: sha512-flYyaFd2CgoCoU+0UKt3pxksgC+S02iTDN0n3LtqaMeXsI9SBcdNujc2k0DeFLzUn/0k538yNjOSdwgCqcrwJA==} engines: {node: ^20.0.0 || ^22.0.0 || >=24.0.0} @@ -27392,6 +27472,47 @@ packages: jsdom: optional: true + vitest@4.1.9: + resolution: {integrity: sha512-nE3/LEyc0z87uHYLZebqCUOaJr2hdtuPp7BQ4BosVFnfltxgAvMG08NyrSGlPpOUWvR27c5flSmYFTNr78L9GQ==} + engines: {node: ^20.0.0 || ^22.0.0 || >=24.0.0} + hasBin: true + peerDependencies: + '@edge-runtime/vm': '*' + '@opentelemetry/api': ^1.9.0 + '@types/node': ^20.0.0 || ^22.0.0 || >=24.0.0 + '@vitest/browser-playwright': 4.1.9 + '@vitest/browser-preview': 4.1.9 + '@vitest/browser-webdriverio': 4.1.9 + '@vitest/coverage-istanbul': 4.1.9 + '@vitest/coverage-v8': 4.1.9 + '@vitest/ui': 4.1.9 + happy-dom: '*' + jsdom: '*' + vite: ^6.4.3 + peerDependenciesMeta: + '@edge-runtime/vm': + optional: true + '@opentelemetry/api': + optional: true + '@types/node': + optional: true + '@vitest/browser-playwright': + optional: true + '@vitest/browser-preview': + optional: true + '@vitest/browser-webdriverio': + optional: true + '@vitest/coverage-istanbul': + optional: true + '@vitest/coverage-v8': + optional: true + '@vitest/ui': + optional: true + happy-dom: + optional: true + jsdom: + optional: true + vlq@0.2.3: resolution: {integrity: sha512-DRibZL6DsNhIgYQ+wNdWDL2SL3bKPlVrRiBqV5yuMm++op8W4kGFtaQfCs4KEJn0wBZcHVHJ3eoywX8983k1ow==} @@ -27987,6 +28108,12 @@ snapshots: '@ai-sdk/provider-utils': 4.0.6(zod@3.25.20) zod: 3.25.20 + '@ai-sdk/anthropic@3.0.13(zod@3.25.76)': + dependencies: + '@ai-sdk/provider': 3.0.3 + '@ai-sdk/provider-utils': 4.0.6(zod@3.25.76) + zod: 3.25.76 + '@ai-sdk/gateway@3.0.14(zod@4.3.5)': dependencies: '@ai-sdk/provider': 3.0.3 @@ -28008,6 +28135,13 @@ snapshots: '@vercel/oidc': 3.1.0 zod: 3.25.20 + '@ai-sdk/gateway@3.0.23(zod@3.25.76)': + dependencies: + '@ai-sdk/provider': 3.0.5 + '@ai-sdk/provider-utils': 4.0.9(zod@3.25.76) + '@vercel/oidc': 3.1.0 + zod: 3.25.76 + '@ai-sdk/gateway@3.0.23(zod@4.3.5)': dependencies: '@ai-sdk/provider': 3.0.5 @@ -28057,6 +28191,13 @@ snapshots: eventsource-parser: 3.0.6 zod: 3.25.20 + '@ai-sdk/provider-utils@4.0.6(zod@3.25.76)': + dependencies: + '@ai-sdk/provider': 3.0.3 + '@standard-schema/spec': 1.1.0 + eventsource-parser: 3.0.6 + zod: 3.25.76 + '@ai-sdk/provider-utils@4.0.6(zod@4.3.5)': dependencies: '@ai-sdk/provider': 3.0.3 @@ -28071,6 +28212,13 @@ snapshots: eventsource-parser: 3.0.6 zod: 3.25.20 + '@ai-sdk/provider-utils@4.0.9(zod@3.25.76)': + dependencies: + '@ai-sdk/provider': 3.0.5 + '@standard-schema/spec': 1.1.0 + eventsource-parser: 3.0.6 + zod: 3.25.76 + '@ai-sdk/provider-utils@4.0.9(zod@4.3.5)': dependencies: '@ai-sdk/provider': 3.0.5 @@ -31849,21 +31997,21 @@ snapshots: '@better-auth/core': 1.5.6(@better-auth/utils@0.3.1)(@better-fetch/fetch@1.1.21)(@opentelemetry/api@1.9.0)(better-call@1.3.2(zod@4.3.6))(jose@6.1.3)(kysely@0.28.17)(nanostores@1.2.0) '@better-auth/utils': 0.3.1 - '@better-auth/sso@1.4.7(better-auth@1.5.6(e4100f2b29709d66937a4c2872ba4f79))': + '@better-auth/sso@1.4.7(better-auth@1.5.6(94b681c845e05f58c362e19ad9fdae01))': dependencies: '@better-fetch/fetch': 1.1.21 - better-auth: 1.5.6(e4100f2b29709d66937a4c2872ba4f79) + better-auth: 1.5.6(94b681c845e05f58c362e19ad9fdae01) fast-xml-parser: 5.7.3 jose: 6.1.3 samlify: 2.13.1 zod: 4.3.5 - '@better-auth/sso@1.5.6(@better-auth/core@1.5.6(@better-auth/utils@0.3.1)(@better-fetch/fetch@1.1.21)(@opentelemetry/api@1.9.0)(better-call@1.3.2(zod@4.3.6))(jose@6.1.3)(kysely@0.28.17)(nanostores@1.2.0))(@better-auth/utils@0.3.1)(better-auth@1.5.6(e4100f2b29709d66937a4c2872ba4f79))(better-call@1.3.2(zod@4.3.6))': + '@better-auth/sso@1.5.6(@better-auth/core@1.5.6(@better-auth/utils@0.3.1)(@better-fetch/fetch@1.1.21)(@opentelemetry/api@1.9.0)(better-call@1.3.2(zod@4.3.6))(jose@6.1.3)(kysely@0.28.17)(nanostores@1.2.0))(@better-auth/utils@0.3.1)(better-auth@1.5.6(94b681c845e05f58c362e19ad9fdae01))(better-call@1.3.2(zod@4.3.6))': dependencies: '@better-auth/core': 1.5.6(@better-auth/utils@0.3.1)(@better-fetch/fetch@1.1.21)(@opentelemetry/api@1.9.0)(better-call@1.3.2(zod@4.3.6))(jose@6.1.3)(kysely@0.28.17)(nanostores@1.2.0) '@better-auth/utils': 0.3.1 '@better-fetch/fetch': 1.1.21 - better-auth: 1.5.6(e4100f2b29709d66937a4c2872ba4f79) + better-auth: 1.5.6(94b681c845e05f58c362e19ad9fdae01) better-call: 1.3.2(zod@4.3.6) fast-xml-parser: 5.7.3 jose: 6.1.3 @@ -44382,6 +44530,14 @@ snapshots: transitivePeerDependencies: - supports-color + '@vitest-evals/core@0.12.0': + dependencies: + zod: 3.25.76 + + '@vitest-evals/report-ui@0.12.0': + dependencies: + '@vitest-evals/core': 0.12.0 + '@vitest/expect@4.1.7': dependencies: '@standard-schema/spec': 1.1.0 @@ -44391,6 +44547,15 @@ snapshots: chai: 6.2.2 tinyrainbow: 3.1.0 + '@vitest/expect@4.1.9': + dependencies: + '@standard-schema/spec': 1.1.0 + '@types/chai': 5.2.3 + '@vitest/spy': 4.1.9 + '@vitest/utils': 4.1.9 + chai: 6.2.2 + tinyrainbow: 3.1.0 + '@vitest/mocker@4.1.7(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.16.2)(yaml@2.8.3))': dependencies: '@vitest/spy': 4.1.7 @@ -44407,15 +44572,32 @@ snapshots: optionalDependencies: vite: 6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3) + '@vitest/mocker@4.1.9(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3))': + dependencies: + '@vitest/spy': 4.1.9 + estree-walker: 3.0.3 + magic-string: 0.30.21 + optionalDependencies: + vite: 6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3) + '@vitest/pretty-format@4.1.7': dependencies: tinyrainbow: 3.1.0 + '@vitest/pretty-format@4.1.9': + dependencies: + tinyrainbow: 3.1.0 + '@vitest/runner@4.1.7': dependencies: '@vitest/utils': 4.1.7 pathe: 2.0.3 + '@vitest/runner@4.1.9': + dependencies: + '@vitest/utils': 4.1.9 + pathe: 2.0.3 + '@vitest/snapshot@4.1.7': dependencies: '@vitest/pretty-format': 4.1.7 @@ -44423,14 +44605,29 @@ snapshots: magic-string: 0.30.21 pathe: 2.0.3 + '@vitest/snapshot@4.1.9': + dependencies: + '@vitest/pretty-format': 4.1.9 + '@vitest/utils': 4.1.9 + magic-string: 0.30.21 + pathe: 2.0.3 + '@vitest/spy@4.1.7': {} + '@vitest/spy@4.1.9': {} + '@vitest/utils@4.1.7': dependencies: '@vitest/pretty-format': 4.1.7 convert-source-map: 2.0.0 tinyrainbow: 3.1.0 + '@vitest/utils@4.1.9': + dependencies: + '@vitest/pretty-format': 4.1.9 + convert-source-map: 2.0.0 + tinyrainbow: 3.1.0 + '@vonage/accounts@1.9.0(encoding@0.1.13)': dependencies: '@vonage/server-client': 1.9.0(encoding@0.1.13) @@ -44951,6 +45148,14 @@ snapshots: '@opentelemetry/api': 1.9.0 zod: 3.25.20 + ai@6.0.50(zod@3.25.76): + dependencies: + '@ai-sdk/gateway': 3.0.23(zod@3.25.76) + '@ai-sdk/provider': 3.0.5 + '@ai-sdk/provider-utils': 4.0.9(zod@3.25.76) + '@opentelemetry/api': 1.9.0 + zod: 3.25.76 + ai@6.0.50(zod@4.3.5): dependencies: '@ai-sdk/gateway': 3.0.23(zod@4.3.5) @@ -45718,7 +45923,7 @@ snapshots: jsonpointer: 5.0.1 leven: 3.1.0 - better-auth@1.5.6(e4100f2b29709d66937a4c2872ba4f79): + better-auth@1.5.6(94b681c845e05f58c362e19ad9fdae01): dependencies: '@better-auth/core': 1.5.6(@better-auth/utils@0.3.1)(@better-fetch/fetch@1.1.21)(@opentelemetry/api@1.9.0)(better-call@1.3.2(zod@4.3.6))(jose@6.1.3)(kysely@0.28.17)(nanostores@1.2.0) '@better-auth/drizzle-adapter': 1.5.6(@better-auth/core@1.5.6(@better-auth/utils@0.3.1)(@better-fetch/fetch@1.1.21)(@opentelemetry/api@1.9.0)(better-call@1.3.2(zod@4.3.6))(jose@6.1.3)(kysely@0.28.17)(nanostores@1.2.0))(@better-auth/utils@0.3.1) @@ -45745,7 +45950,7 @@ snapshots: react-dom: 19.2.3(react@19.2.3) solid-js: 1.9.6 svelte: 5.55.7(@typescript-eslint/types@8.39.1) - vitest: 4.1.7(@edge-runtime/vm@3.0.3)(@opentelemetry/api@1.9.0)(@types/node@22.15.13)(happy-dom@20.8.9)(jsdom@20.0.3)(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3)) + vitest: 4.1.9(@edge-runtime/vm@3.0.3)(@opentelemetry/api@1.9.0)(@types/node@22.15.13)(happy-dom@20.8.9)(jsdom@20.0.3)(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3)) transitivePeerDependencies: - '@cloudflare/workers-types' - '@opentelemetry/api' @@ -59368,6 +59573,16 @@ snapshots: vite: 6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3) optional: true + vitest-evals@0.12.0(ai@6.0.50(zod@3.25.76))(tinyrainbow@3.1.0)(vitest@4.1.9(@edge-runtime/vm@3.0.3)(@opentelemetry/api@1.9.0)(@types/node@22.15.13)(happy-dom@20.8.9)(jsdom@20.0.3)(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3)))(zod@3.25.76): + dependencies: + '@vitest-evals/core': 0.12.0 + '@vitest-evals/report-ui': 0.12.0 + tinyrainbow: 3.1.0 + vitest: 4.1.9(@edge-runtime/vm@3.0.3)(@opentelemetry/api@1.9.0)(@types/node@22.15.13)(happy-dom@20.8.9)(jsdom@20.0.3)(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3)) + optionalDependencies: + ai: 6.0.50(zod@3.25.76) + zod: 3.25.76 + vitest@4.1.7(@edge-runtime/vm@3.0.3)(@opentelemetry/api@1.9.0)(@types/node@22.15.13)(happy-dom@20.8.9)(jsdom@20.0.3)(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.16.2)(yaml@2.8.3)): dependencies: '@vitest/expect': 4.1.7 @@ -59430,6 +59645,37 @@ snapshots: transitivePeerDependencies: - msw + vitest@4.1.9(@edge-runtime/vm@3.0.3)(@opentelemetry/api@1.9.0)(@types/node@22.15.13)(happy-dom@20.8.9)(jsdom@20.0.3)(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3)): + dependencies: + '@vitest/expect': 4.1.9 + '@vitest/mocker': 4.1.9(vite@6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3)) + '@vitest/pretty-format': 4.1.9 + '@vitest/runner': 4.1.9 + '@vitest/snapshot': 4.1.9 + '@vitest/spy': 4.1.9 + '@vitest/utils': 4.1.9 + es-module-lexer: 2.0.0 + expect-type: 1.3.0 + magic-string: 0.30.21 + obug: 2.1.1 + pathe: 2.0.3 + picomatch: 4.0.4 + std-env: 4.1.0 + tinybench: 2.9.0 + tinyexec: 1.0.2 + tinyglobby: 0.2.16 + tinyrainbow: 3.1.0 + vite: 6.4.3(@types/node@22.15.13)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.31.6)(tsx@4.21.0)(yaml@2.8.3) + why-is-node-running: 2.3.0 + optionalDependencies: + '@edge-runtime/vm': 3.0.3 + '@opentelemetry/api': 1.9.0 + '@types/node': 22.15.13 + happy-dom: 20.8.9 + jsdom: 20.0.3 + transitivePeerDependencies: + - msw + vlq@0.2.3: {} vscode-oniguruma@1.7.0: {}