Skip to content

feat(agent-evals): add suite-based behavioral eval harness for agent onboarding fixes NV-8059#11589

Merged
djabarovgeorge merged 22 commits into
nextfrom
feat/agent-evals-harness
Jun 22, 2026
Merged

feat(agent-evals): add suite-based behavioral eval harness for agent onboarding fixes NV-8059#11589
djabarovgeorge merged 22 commits into
nextfrom
feat/agent-evals-harness

Conversation

@djabarovgeorge

@djabarovgeorge djabarovgeorge commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds @novu/agent-evals — a suite-based behavioral eval harness that runs a real LLM agent against scripted scenarios with a mocked CLI, then grades playbook adherence via deterministic checks and LLM-as-judge graders.
  • First suite (agent-onboarding) covers 8 scenarios for packages/shared/docs/agent-onboarding.md (the npx novu connect flow).
  • Exports @novu/shared/docs/agent-onboarding.md so the harness resolves the canonical playbook via package export.
  • Adds CI workflow (path-triggered on playbook/harness changes) that runs the full eval suite including judge graders.

Why

Regression-test agent onboarding playbook behavior (connect flags, discipline, persona) before shipping doc or dashboard prompt changes.

Required CI secrets / env

Add in GitHub → Settings → Secrets and variables → Actions before merging:

Variable Type Workflow Required? Notes
ANTHROPIC_API_KEY Secret agent-evals.yml Yes Anthropic API key for agent + judge eval runs on PRs to next

Judge graders always run when evals run (no NOVU_EVAL_JUDGE toggle). Optional overrides: NOVU_EVAL_MODEL, NOVU_EVAL_JUDGE_MODEL.

Architecture

flowchart TB
  subgraph entry["Entry (vitest)"]
    Eval["onboarding.eval.ts\ndescribeEval per scenario"]
    Adapters["adapters.ts\ngrader → judge"]
  end

  subgraph core["Core simulation (src/core/)"]
    Harness["harness.ts\ncreateHarness + AI SDK loop"]
    Tools["tools.ts\nBash · BashOutput · AskUserQuestion · Read"]
    MockShell["mock-shell.ts\nTape replay engine"]
    Recorder["recorder.ts\nRunResult builder"]
    Graders["graders.ts\ndefineGraders · contains · judge"]
    Judge["judge.ts\nLLM-as-judge"]
  end

  subgraph suite["Suite (src/suites/agent-onboarding/)"]
    Scenarios["scenarios/{id}/\nscenario.ts · graders.ts · project/"]
    Parser["connect-parser.ts"]
    Tape["tape.ts"]
    Catalog["catalog.ts"]
  end

  Eval --> Harness
  Eval --> Adapters
  Adapters --> Graders
  Adapters --> Judge
  Harness --> Tools
  Tools --> MockShell
  Tools --> Recorder
  Harness --> Recorder
  Scenarios --> Eval
  Parser --> MockShell
  Tape --> MockShell
Loading

Test plan

  • pnpm --filter @novu/agent-evals check
  • pnpm --filter @novu/agent-evals test
  • pnpm --filter @novu/agent-evals eval with ANTHROPIC_API_KEY (runs deterministic + judge graders)
  • Single scenario: pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t keyless-slack-secure

Linear: https://linear.app/novu/issue/NV-8059/add-suite-based-behavioral-eval-harness-for-agent-onboarding-playbook

Greptile Summary

This PR introduces @novu/agent-evals, a suite-based behavioral eval harness that runs a real LLM agent against 8 scripted onboarding scenarios with a mocked CLI, then grades playbook adherence via deterministic checks and LLM-as-judge graders. It also exports @novu/shared/docs/agent-onboarding.md for canonical playbook resolution and adds a path-triggered CI workflow.

  • Core harness (libs/agent-evals/src/core/): MockShellEngine with tape-replay, shell-escape-aware export capture, fixture path containment via path.relative, and a single-pass quote/escape lexer in the watcher guard. Most correctness issues from prior review rounds have been addressed.
  • Agent-onboarding suite (src/suites/agent-onboarding/): 8 scenarios covering keyless, dashboard-OAuth, discipline, and persona paths. The shell-word tokenizer in connect-parser.ts now strips quotes and handles positional descriptions anywhere in the command; buildDefaultTape now enforces requireKeyless by default; three scenarios that were previously missing requireKeyless: true have been fixed — but discipline-no-timers still omits this guard.
  • Non-eval changes: Import reordering and line-length reformatting across apps/api, apps/dashboard, packages/chat-adapter, and the playground; no logic changes.

Confidence Score: 4/5

Safe to merge with one scenario correctness fix outstanding: discipline-no-timers accepts the wrong auth path.

Most correctness issues from prior review rounds have been addressed — shell-word tokenizer, export capture, path guard, watcher guard, tape keyless enforcement for three scenarios, and the QR host-aware check. One scenario (discipline-no-timers) was missed in the keyless enforcement sweep: its tape accepts a command without --keyless even though the user prompt has no dashboard login, so an agent can pass all four graders on the wrong auth path. The follow-up injection via toolResult.output is also a known unresolved gap, but its impact is isolated to the slack-in-chat-rerun scenario's in-chat token path. The non-eval code changes are pure formatting with no logic impact.

libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts needs requireKeyless: true. libs/agent-evals/src/suites/agent-onboarding/harness.ts — the toolResult.output vs .result discrepancy for follow-up injection remains open.

Important Files Changed

Filename Overview
libs/agent-evals/src/core/tools.ts New file implementing mock harness tools (Bash, BashOutput, AskUserQuestion, Read). Includes shell-escape-aware export capture, fixture path containment guard using path.relative, and watcher command rejection. Read tool now records exactly once per call.
libs/agent-evals/src/core/recorder.ts New RunRecorder class and shell utilities. stripShellStringLiterals lexer correctly handles the apostrophe idiom so watcher guard no longer false-fires on quoted agent descriptions.
libs/agent-evals/src/core/mock-shell.ts New MockShellEngine implementing tape-replay with pendingWhen support. Shells correctly stay running until killed when pendingWhen returns true.
libs/agent-evals/src/suites/agent-onboarding/harness.ts New scenario harness driving generateText multi-turn loop with follow-up injection. shouldInjectFollowUp checks toolResult.output which is not the AI SDK field (should be .result), causing followUpOnOptionId to never fire via that path.
libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts Shell-word tokenizer now strips quotes so --channel "slack" correctly yields slack. connectValidate now rejects commands missing --channel when allowedChannels is required. findConnectPositional handles descriptions anywhere in the command.
libs/agent-evals/src/suites/agent-onboarding/tape.ts buildDefaultTape now defaults requireKeyless to true when requireNoKeyless is not set, fixing the keyless tape enforcement gap.
libs/agent-evals/src/suites/agent-onboarding/catalog.ts Suite-wide grader catalog. qrHostAware now accepts both OS open and inline Markdown image paths. noConnectOnKeylessWhatsapp correctly fails when connect commands were tracked.
.github/workflows/agent-evals.yml New CI workflow triggering on playbook and harness path changes. Runs full eval suite including judge graders on every PR that triggers it since ANTHROPIC_API_KEY is always exported.
libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts Polling-discipline scenario that should enforce keyless mode (user prompt has no dashboard login) but is missing requireKeyless: true, allowing an agent to pass with the wrong auth path.
libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts Slack in-chat rerun scenario. pendingWhen correctly keeps the first shell running until killed. requireNoKeyless is set because the user is signed in to the dashboard.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[onboarding.eval.ts\ndescribeEval per scenario] --> B[scenarioHarness\ngenerateText loop]
    B --> C[createHarnessTools\nBash · BashOutput · AskUserQuestion · Read]
    C --> D{Command type}
    D -->|export VAR=...| E[captureLeadingExports\nstores to context.env]
    D -->|novu connect ...| F[MockShellEngine.createShell]
    D -->|kill/pkill| G[MockShellEngine.killShell]
    D -->|open/xdg-open| H[recorder.recordOpenedFile]
    F --> I[connectParser.parse\ntokenizeShellWords · readFlagValue\nfindConnectPositional]
    I --> J[connectValidate\nrequireKeyless · allowedChannels · --ci]
    J -->|valid| K[selectTapeChunks\npendingWhen check]
    J -->|invalid| L[exitCode=1 error chunk]
    K --> M[pollShell / BashOutput polling]
    M --> N[RunRecorder.build\nRunResult]
    N --> O[Graders\ndeterministic · judge]
    O --> P{pass / fail / skip}
    B -->|follow-up injection| Q[shouldInjectFollowUp\nfollowUpTextPattern OR\ntoolResult.output ⚠️]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[onboarding.eval.ts\ndescribeEval per scenario] --> B[scenarioHarness\ngenerateText loop]
    B --> C[createHarnessTools\nBash · BashOutput · AskUserQuestion · Read]
    C --> D{Command type}
    D -->|export VAR=...| E[captureLeadingExports\nstores to context.env]
    D -->|novu connect ...| F[MockShellEngine.createShell]
    D -->|kill/pkill| G[MockShellEngine.killShell]
    D -->|open/xdg-open| H[recorder.recordOpenedFile]
    F --> I[connectParser.parse\ntokenizeShellWords · readFlagValue\nfindConnectPositional]
    I --> J[connectValidate\nrequireKeyless · allowedChannels · --ci]
    J -->|valid| K[selectTapeChunks\npendingWhen check]
    J -->|invalid| L[exitCode=1 error chunk]
    K --> M[pollShell / BashOutput polling]
    M --> N[RunRecorder.build\nRunResult]
    N --> O[Graders\ndeterministic · judge]
    O --> P{pass / fail / skip}
    B -->|follow-up injection| Q[shouldInjectFollowUp\nfollowUpTextPattern OR\ntoolResult.output ⚠️]
Loading

Reviews (10): Last reviewed commit: "Merge remote-tracking branch 'origin/nex..." | Re-trigger Greptile

djabarovgeorge and others added 3 commits June 16, 2026 11:07
Introduce @novu/agent-evals with a mocked CLI runner, deterministic and LLM judge graders, and an agent-onboarding scenario suite, plus CI to run evals on doc changes and nightly.

Co-authored-by: Cursor <cursoragent@cursor.com>
…es NV-8059

Co-authored-by: Cursor <cursoragent@cursor.com>
@linear-code

linear-code Bot commented Jun 17, 2026

Copy link
Copy Markdown

NV-8059

@netlify

netlify Bot commented Jun 17, 2026

Copy link
Copy Markdown

Deploy Preview for dashboard-v2-novu-staging ready!

Name Link
🔨 Latest commit 707fcce
🔍 Latest deploy log https://app.netlify.com/projects/dashboard-v2-novu-staging/deploys/6a3917c64a50f900084f1922
😎 Deploy Preview https://deploy-preview-11589.dashboard-v2.novu-staging.co
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces the @novu/agent-evals library: a vitest-evals–based behavioral evaluation harness for Novu coding-agent playbooks. It includes core types, mock shell engine with tape replay, run recorder, deterministic and LLM-as-judge graders, four AI SDK harness tools, eight agent-onboarding scenarios, scenario harness, adapters, CI workflow, Nx project config, and updated onboarding playbook doc.

Changes

Agent Evals Harness

Layer / File(s) Summary
Core data types and path utilities
libs/agent-evals/src/core/types.ts, libs/agent-evals/src/core/resolve-package-file.ts, libs/agent-evals/src/load-env.ts
Defines all shared TypeScript contracts: grader types (GraderResult, GraderOutcome, GraderFn, GraderDefinition), ToolCallRecord, Tape, EvalScenario, RunResult, MockShellState, CommandParser, RegisteredScenario, Suite; exports PACKAGE_ROOT, normalizePath, resolvePackageFile, and dotenv side-effect loader.
Mock shell engine and run recorder
libs/agent-evals/src/core/mock-shell.ts, libs/agent-evals/src/core/recorder.ts
MockShellEngine replays scripted tapes with conditional chunk selection via when() predicates, per-call stdout emission, and kill support. RunRecorder accumulates tool calls, messages, deduplicated URLs, opened files, and shell lifecycle events into RunResult snapshots. Includes command classifiers: extractUrls, isKillCommand, isOpenCommand, isForbiddenWatcherCommand, shellSummary.
Grader utilities and LLM judge
libs/agent-evals/src/core/graders.ts, libs/agent-evals/src/core/judge.ts
graders.ts provides fail, labeled, defineGraders, deterministic text/pattern graders, toolCallsNamed, transcriptText, and judge() factory. judge.ts implements runJudge via Anthropic generateText with env-driven model fallback (claude-sonnet-4-5 default) and YES/NO/UNKNOWN verdict parsing.
Harness tools: Bash, BashOutput, AskUserQuestion, Read
libs/agent-evals/src/core/tools.ts
createHarnessTools builds four AI SDK tools. Bash handles env capture (export VAR=...), PNG opens, shell kill, tape replay, URL extraction. BashOutput polls shells and reads sentinel files via regex patterns. AskUserQuestion matches scripted answers by regex/substring. Read enforces project-root path safety and blocks /tmp/ and .log paths.
Scenario harness, adapters, and eval entry point
libs/agent-evals/src/suites/agent-onboarding/harness.ts, libs/agent-evals/src/suites/agent-onboarding/adapters.ts, libs/agent-evals/src/suites/agent-onboarding/onboarding.eval.ts, libs/agent-evals/vitest.config.ts, libs/agent-evals/vitest.evals.config.ts
scenarioHarness runs multi-turn generateText with follow-up injection based on regex/option-id matching and returns RunResult plus token counts. adapters.ts bridges GraderDefinition to vitest-evals Judge via graderToJudge/gradersToJudges/isJudgeEnabled. onboarding.eval.ts registers scenarios via describeEval with judgeThreshold: 0.8. Vitest configs set file patterns, timeouts, concurrency (default 4), and reporters.
Connect parser, tape utilities, and onboarding catalog
libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts, libs/agent-evals/src/suites/agent-onboarding/tape.ts, libs/agent-evals/src/suites/agent-onboarding/catalog.ts, packages/shared/package.json
connectParser parses ConnectFlags; connectValidate enforces keyless/channel/secretKey rules. connectTape/buildDefaultTape construct Tape objects with validation wired. catalog.ts defines 30\+ deterministic grader functions (secret-key, polling, URL surfacing, description tokens, etc.) plus sharedJudgeGraders. @novu/shared exports agent-onboarding.md.
Suite kit barrel and agentOnboardingSuite
libs/agent-evals/src/suites/agent-onboarding/kit.ts, libs/agent-evals/src/suites/agent-onboarding/index.ts
kit.ts re-exports grader helpers, types, catalog, and tape utilities as stable scenario import surface. index.ts exports agentOnboardingSuite wiring connectParser, sentinel/follow-up patterns, onTrackedCommand hook, and all eight scenario+grader registrations.
Eight agent-onboarding scenarios with fixtures and graders
libs/agent-evals/src/suites/agent-onboarding/scenarios/*/scenario.ts, libs/agent-evals/src/suites/agent-onboarding/scenarios/*/graders.ts, libs/agent-evals/src/suites/agent-onboarding/scenarios/*/project/*
Each scenario: scenario.ts (metadata, projectRoot, scriptedAnswers, connectTape), graders.ts (labeled graders from catalog + sharedJudgeGraders), and project fixtures (README, package.json, optional auth URL). Covers keyless Slack, dashboard OAuth login, WhatsApp redirect, email handoff, Telegram QR, Slack in-chat rerun, persona infra exclusion, and polling discipline.
Grader unit test
libs/agent-evals/src/suites/agent-onboarding/graders.test.ts
Adds buildResult synthetic RunResult builder and averageScore helper. Tests assert keyless-whatsapp-redirect graders score 1 on passing run and below 1 on failing run.
Package config, Nx project, CI workflow, and scripts
libs/agent-evals/package.json, libs/agent-evals/project.json, libs/agent-evals/tsconfig.json, libs/agent-evals/scripts/run-evals.sh, .github/workflows/agent-evals.yml, libs/agent-evals/.env.example, libs/agent-evals/.gitignore
package.json reconfigures as ESM with eval/test scripts and vitest-evals deps. Nx project and tsconfig added. run-evals.sh wraps pnpm eval with --judge flag. GitHub Actions workflow triggers on PR path filter (packages/shared/docs/agent-onboarding.md), injects ANTHROPIC_API_KEY, sets NOVU_EVAL_JUDGE: 'true', 45-min timeout.
Onboarding playbook, README, and Cursor triage skill
packages/shared/docs/agent-onboarding.md, libs/agent-evals/README.md, .cursor/skills/triage-agent-eval-failures/SKILL.md, .cursor/skills/triage-agent-eval-failures/reference.md
agent-onboarding.md tightens Step 3 backgrounding rules (forbid timers/log watchers), adds keyless-dashboard HARD STOP, requires verbatim CLI success line in Step 5. README documents harness architecture, module layout, run commands, env vars, threshold semantics, and extension guide. Cursor skill adds structured triage workflow with five worked examples.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Possibly related PRs

  • novuhq/novu#11506: The agent-onboarding eval suite validates Slack setup-link behavior via NOVU_CONNECT_SLACK_SETUP_URL that is wired by the retrieved PR's new Slack setup-link CLI/API endpoints.
  • novuhq/novu#11516: Both PRs update packages/shared/docs/agent-onboarding.md guidance that is referenced by dashboard Cursor deep-link initialization.
  • novuhq/novu#11566: Agent-onboarding eval suite graders directly validate novu connect authentication flows (keyless vs dashboard OAuth flag semantics) from the retrieved PR.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title follows Conventional Commits format with valid type and scope, includes a clear imperative description, and ends with the Linear ticket reference as required.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@socket-security

socket-security Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpg@​8.22.09910010091100
Addedvitest-evals@​0.12.09410010095100
Addedresend@​4.8.099100100100100

View full report

@djabarovgeorge djabarovgeorge marked this pull request as draft June 17, 2026 07:34

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Comment thread libs/agent-evals/src/core/tools.ts Outdated
Comment thread libs/agent-evals/src/core/tools.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (4)
libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts (1)

29-29: 💤 Low value

Consider consistent boolean coercion style in predicates.

Line 29 uses !flags.slackConfigToken while lines 33 and 42 use Boolean(flags.slackConfigToken). For consistency and idiomatic TypeScript, consider using truthiness directly in all predicates:

       {
         stdout: 'NOVU_CONNECT_SLACK_AUTHORIZE_URL=https://slack.test/oauth/rerun-token',
-        when: (flags) => Boolean(flags.slackConfigToken),
+        when: (flags) => !!flags.slackConfigToken,
       },
       {
         stdout: [
           '✓ Your agent is live.',
           '  Agent: Slack Rerun Agent (slack-rerun-1)',
           '  → Check Slack — your agent just messaged you.',
           '  Dashboard: https://dashboard.novu.test/agents/slack-rerun-1',
         ].join('\n'),
-        when: (flags) => Boolean(flags.slackConfigToken),
+        when: (flags) => !!flags.slackConfigToken,
       },

Also applies to: 33-33, 42-42

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts`
at line 29, The predicate logic in the scenario conditions uses inconsistent
boolean coercion styles: line 29 uses negation with `!flags.slackConfigToken`,
while lines 33 and 42 use explicit Boolean coercion with
`Boolean(flags.slackConfigToken)`. Standardize all three predicate checks (at
lines 29, 33, and 42) to use consistent truthiness checking. Convert them all to
use direct boolean values without explicit Boolean() wrapping or negation
operators, relying on JavaScript's natural truthiness evaluation for cleaner,
more idiomatic TypeScript code.
libs/agent-evals/src/core/types.ts (1)

7-149: 🏗️ Heavy lift

Backend object contracts should use interface in this .ts module.

Most exported object-shaped contracts here are declared as type aliases. For this backend TypeScript code, please migrate object shapes to interface (keep union/function aliases as type).

♻️ Example pattern (apply consistently across the file)
-export type GraderOutcome = {
+export interface GraderOutcome {
   status: GraderResult;
   reason?: string;
-};
+}

-export type ToolCallRecord = {
+export interface ToolCallRecord {
   name: string;
   args: Record<string, unknown>;
   result?: unknown;
   timestamp: number;
-};
+}

As per coding guidelines, on the backend, use interface for type definitions in *.ts files.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/agent-evals/src/core/types.ts` around lines 7 - 149, Migrate
object-shaped type contracts in this file from type aliases to interface
declarations. Convert all object shapes including GraderOutcome,
GraderDefinition, ToolCallRecord, ParsedCommand, TapeChunk, Tape,
ScriptedAnswer, EvalScenario, RunResult, ScenarioScore, RunnerOptions,
MockShellState, CommandParser, RegisteredScenario, and Suite to use the
interface keyword instead of type. Keep the GraderFn as a type alias since it
defines a function signature. Preserve all generic parameters and property
definitions exactly as they are, only changing the declaration syntax from type
to interface.

Source: Coding guidelines

libs/agent-evals/src/core/run-agent.ts (1)

9-14: ⚡ Quick win

Use interface for backend options object shapes.

At Line 9, RunAgentOptions is an object type alias; prefer interface here to match backend conventions.

As per coding guidelines, "**/*.{ts,tsx}: On the backend: use interface for type definitions; on the frontend: use type for type definitions."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/agent-evals/src/core/run-agent.ts` around lines 9 - 14, The
RunAgentOptions definition at line 9 is using the `type` keyword but backend
code conventions require `interface` for type definitions of object shapes.
Convert the `export type RunAgentOptions<TParsed = ParsedCommand>` declaration
to use `interface` instead, maintaining the same generic parameter and all
properties (suite, scenario, model, maxSteps) with their existing types and
optional modifiers.

Source: Coding guidelines

libs/agent-evals/src/core/tools.ts (1)

17-25: ⚡ Quick win

Use interface for backend object type declarations.

At Line 17, HarnessContext is declared as a type object alias. Repository rules for backend TS favor interface for type definitions.

As per coding guidelines, "**/*.{ts,tsx}: On the backend: use interface for type definitions; on the frontend: use type for type definitions."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/agent-evals/src/core/tools.ts` around lines 17 - 25, The HarnessContext
type declaration at line 17 uses the `type` keyword instead of `interface`,
which violates backend coding guidelines that require `interface` for type
definitions. Convert the HarnessContext type alias to an interface by replacing
the `type HarnessContext<TParsed = ParsedCommand> = {` syntax with `interface
HarnessContext<TParsed = ParsedCommand> {` and ensure the closing brace
maintains the same structure.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/agent-evals.yml:
- Around line 28-29: The checkout step using actions/checkout@v4 is missing the
persist-credentials security setting. Add the `persist-credentials: false`
option to the checkout step to prevent the GitHub token from persisting in the
workspace and potentially leaking through artifacts or logs. This is a security
best practice to protect credentials from unintended exposure.
- Around line 48-54: The direct expansion of ${{ inputs.enable_judge }} in the
run script creates a script injection vulnerability. Move inputs.enable_judge to
the environment block by adding an env section at the same level as the run
section, setting a variable like ENABLE_JUDGE to ${{ inputs.enable_judge }}.
Then in the conditional check, replace the direct expansion with a reference to
the environment variable instead of ${{ inputs.enable_judge }}, so the value is
safely isolated from shell execution.
- Line 29: The actions/checkout action is referenced using a version tag (`@v4`)
which creates a supply chain security risk. Replace the `@v4` reference with a
specific 40-character commit SHA to pin the action to an immutable version,
preventing unauthorized updates by the action owner.

In `@libs/agent-evals/package.json`:
- Around line 16-27: The dependencies in the package.json file violate the
minimumReleaseAge policy. Update the version constraints for ai, typescript,
`@ai-sdk/anthropic`, and `@types/node` to use stable releases that meet the age
requirement instead of the recent versions currently specified. Check the policy
documentation to determine the exact minimum age threshold, then replace each
problematic dependency version with an older stable release that was published
at least that many days ago. Verify all four packages (ai, typescript,
`@ai-sdk/anthropic`, and `@types/node`) comply with the policy before finalizing the
changes.

In `@libs/agent-evals/README.md`:
- Around line 141-162: The fenced code block in the README.md file is missing a
language label on the opening fence, which triggers the MD040 markdown linting
error. Add a language identifier like `text` to the opening triple backticks of
the code block that displays the directory structure to satisfy markdown linting
requirements.

In `@libs/agent-evals/src/core/graders.ts`:
- Around line 78-87: The definition.run(result) call in the grading loop lacks
error handling, causing any unhandled exception from a single grader to abort
the entire grading process and lose outcomes for all remaining graders. Wrap the
definition.run(result) invocation and the toOutcome conversion in a try-catch
block, capturing any thrown errors. In the catch block, create an appropriate
failed outcome (with status set to something like 'error' or 'failed') and
assign it to outcomes[name], then still invoke the onGraderResult callback so
the error is properly reported, and continue to the next grader instead of
letting the exception propagate up.

In `@libs/agent-evals/src/core/mock-shell.ts`:
- Around line 39-57: The parser.parse method call on line 41 can throw an error
that escapes the execute method and crashes the scenario. Wrap the
this.parser.parse(command, env) call in a try-catch block to catch any thrown
errors and handle them gracefully by setting chunks to an error message and
exitCode to 1, similar to how validationError is handled. Additionally, replace
the implicit truthiness checks for the parsed variable in the conditional
statements (the conditions checking isTracked && parsed and isTracked &&
!this.scenario.tape) with explicit null checks using parsed !== null to ensure
proper handling of the parsed value.

In `@libs/agent-evals/src/core/recorder.ts`:
- Around line 87-89: The isKillCommand function's regex pattern matches the
kill-related keywords anywhere in the command string, causing false positives
like 'echo kill' to be misclassified as kill commands. Modify the regex pattern
in isKillCommand to anchor it to the beginning of the string so it only matches
when kill, pkill, or killall is the actual command being invoked, not a
substring elsewhere in the command. Use a pattern that starts with a
beginning-of-string anchor and optionally matches leading whitespace before the
command keywords.

In `@libs/agent-evals/src/core/tools.ts`:
- Around line 59-65: The regex pattern at line 59 that matches export statements
only captures the export prefix of composite bash commands like `export X='1' &&
npx novu connect`. When a match is found, the function returns true immediately
without processing any tail commands that follow the `&&` operator, causing
composite commands to be treated as no-ops. To fix this, after matching the
export pattern and setting the environment variable, check if the original
command contains a `&&` operator followed by additional commands. If it does,
extract and return the tail command (the part after `&&`) instead of returning
true, so that the remaining command can be properly processed in subsequent
iterations or execution steps.
- Around line 176-183: The sentinel file reading at line 181 uses fs.readFile
directly to read the matched path without any safety checks, which can allow
access to files outside the workspace fixtures. Replace the
fs.readFile(match[1], 'utf8') call with readFixtureFile(...) function to ensure
the file path is validated and restricted to the intended fixture directory
before reading, maintaining workspace safety boundaries when extracting URLs
from the sentinel file contents.
- Around line 161-168: The recordPoll method is being called before validating
that the shellId is valid, allowing invalid shell IDs to mutate the run state.
Move the context.recorder.recordPoll(shellId) call to execute only after the
shell validation check (after the if (!shell) guard clause that returns early),
ensuring that polling is only recorded for valid shell IDs.
- Around line 51-53: The path containment check at line 51 using
`absolutePath.startsWith(projectRoot)` is insufficient for security because
string prefix matching does not account for path boundaries. A path like
`/root/proj-evil/file` would incorrectly pass the check if projectRoot is
`/root/proj`. Fix this by ensuring proper path boundary validation, such as
verifying that projectRoot ends with a path separator before checking
containment, or by using path normalization and comparison that respects
directory boundaries instead of simple string prefix matching.

In `@libs/agent-evals/src/core/types.ts`:
- Around line 155-156: The normalizePath function processes replacements in the
wrong order, causing Windows-style paths with `.\` prefixes to not be properly
normalized. The current implementation attempts to remove the leading `./`
before converting backslashes to forward slashes, so a path like `.\foo\bar`
keeps its `./` prefix after the backslashes are converted. Swap the order of the
two replace operations in the normalizePath function so that
backslash-to-forward-slash conversion happens first, then remove the leading
`./` prefix.

In `@libs/agent-evals/src/index.ts`:
- Around line 20-22: The parseArgs function has two issues: first, when parsing
flag values like --suite, --judge, and others (at lines 20-22, 26-28, 32-34,
48-50, 69-71), it accesses argv[index + 1] without checking if the value exists
or if it's another flag starting with '-'. Second, the --fail-under flag parsing
(at lines 116-121) accepts NaN values which bypass the numeric comparison gate
because NaN comparisons always return false. Fix this by adding bounds checking
before accessing argv[index + 1] and validating that the next argument is not a
flag, and for --fail-under specifically, validate that the parsed number is not
NaN using Number.isNaN() and reject invalid input appropriately.

In
`@libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.ts`:
- Line 6: The scenario currently uses a single qrPath variable pointing to
telegram-setup-qr.png for both NOVU_CONNECT_TELEGRAM_SETUP_QR_PNG and
NOVU_CONNECT_TELEGRAM_DEEPLINK_QR_PNG environment variables. Since these
represent distinct QR codes with different handlers in production, create a
separate deeplink QR fixture file named telegram-deeplink-qr.png in the project
fixture directory, then create a separate variable for the deeplink QR path and
use it to set NOVU_CONNECT_TELEGRAM_DEEPLINK_QR_PNG instead of reusing qrPath.

---

Nitpick comments:
In `@libs/agent-evals/src/core/run-agent.ts`:
- Around line 9-14: The RunAgentOptions definition at line 9 is using the `type`
keyword but backend code conventions require `interface` for type definitions of
object shapes. Convert the `export type RunAgentOptions<TParsed =
ParsedCommand>` declaration to use `interface` instead, maintaining the same
generic parameter and all properties (suite, scenario, model, maxSteps) with
their existing types and optional modifiers.

In `@libs/agent-evals/src/core/tools.ts`:
- Around line 17-25: The HarnessContext type declaration at line 17 uses the
`type` keyword instead of `interface`, which violates backend coding guidelines
that require `interface` for type definitions. Convert the HarnessContext type
alias to an interface by replacing the `type HarnessContext<TParsed =
ParsedCommand> = {` syntax with `interface HarnessContext<TParsed =
ParsedCommand> {` and ensure the closing brace maintains the same structure.

In `@libs/agent-evals/src/core/types.ts`:
- Around line 7-149: Migrate object-shaped type contracts in this file from type
aliases to interface declarations. Convert all object shapes including
GraderOutcome, GraderDefinition, ToolCallRecord, ParsedCommand, TapeChunk, Tape,
ScriptedAnswer, EvalScenario, RunResult, ScenarioScore, RunnerOptions,
MockShellState, CommandParser, RegisteredScenario, and Suite to use the
interface keyword instead of type. Keep the GraderFn as a type alias since it
defines a function signature. Preserve all generic parameters and property
definitions exactly as they are, only changing the declaration syntax from type
to interface.

In
`@libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts`:
- Line 29: The predicate logic in the scenario conditions uses inconsistent
boolean coercion styles: line 29 uses negation with `!flags.slackConfigToken`,
while lines 33 and 42 use explicit Boolean coercion with
`Boolean(flags.slackConfigToken)`. Standardize all three predicate checks (at
lines 29, 33, and 42) to use consistent truthiness checking. Convert them all to
use direct boolean values without explicit Boolean() wrapping or negation
operators, relying on JavaScript's natural truthiness evaluation for cleaner,
more idiomatic TypeScript code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4e426a29-a422-465c-ab13-f7710c385864

📥 Commits

Reviewing files that changed from the base of the PR and between 9338eee and d3d61fd.

⛔ Files ignored due to path filters (2)
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/telegram-setup-qr.png is excluded by !**/*.png
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (62)
  • .github/workflows/agent-evals.yml
  • libs/agent-evals/.env.example
  • libs/agent-evals/.gitignore
  • libs/agent-evals/README.md
  • libs/agent-evals/package.json
  • libs/agent-evals/project.json
  • libs/agent-evals/scripts/run-evals.sh
  • libs/agent-evals/src/core/graders.ts
  • libs/agent-evals/src/core/judge.ts
  • libs/agent-evals/src/core/mock-shell.ts
  • libs/agent-evals/src/core/recorder.ts
  • libs/agent-evals/src/core/reporters.ts
  • libs/agent-evals/src/core/resolve-package-file.ts
  • libs/agent-evals/src/core/run-agent.ts
  • libs/agent-evals/src/core/runner.ts
  • libs/agent-evals/src/core/tools.ts
  • libs/agent-evals/src/core/types.ts
  • libs/agent-evals/src/index.ts
  • libs/agent-evals/src/load-env.ts
  • libs/agent-evals/src/self-test.ts
  • libs/agent-evals/src/suites/agent-onboarding/catalog.ts
  • libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts
  • libs/agent-evals/src/suites/agent-onboarding/index.ts
  • libs/agent-evals/src/suites/agent-onboarding/kit.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/README.md
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/novu-connect-auth-url.txt
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/package.json
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/README.md
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/package.json
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/README.md
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/package.json
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/README.md
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/package.json
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/README.md
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/package.json
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/README.md
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/package.json
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/README.md
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/novu-connect-auth-url.txt
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/package.json
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/README.md
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/package.json
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/tape.ts
  • libs/agent-evals/src/suites/registry.ts
  • libs/agent-evals/tsconfig.json
  • packages/shared/package.json

Comment thread .github/workflows/agent-evals.yml Outdated
Comment thread .github/workflows/agent-evals.yml Outdated
Comment thread .github/workflows/agent-evals.yml Outdated
Comment thread libs/agent-evals/package.json
Comment thread libs/agent-evals/README.md Outdated
Comment thread libs/agent-evals/src/core/tools.ts
Comment thread libs/agent-evals/src/core/tools.ts
Comment thread libs/agent-evals/src/core/types.ts Outdated
Comment thread libs/agent-evals/src/index.ts Outdated
Comment thread .github/workflows/agent-evals.yml Outdated
Comment thread libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts Outdated
Comment thread libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts Outdated
Comment thread libs/agent-evals/src/core/tools.ts Outdated
Comment thread libs/agent-evals/src/core/tools.ts
Comment thread libs/agent-evals/scripts/run-evals.sh Outdated
Comment thread libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts Outdated
Comment thread libs/agent-evals/src/core/runner.ts Outdated
djabarovgeorge and others added 3 commits June 17, 2026 15:14
…ing system

- Updated the agent-evals harness to utilize vitest for running evaluations.
- Introduced new environment variables for LLM judge configuration.
- Removed legacy CLI entry point and refactored grading logic to improve clarity and maintainability.
- Enhanced grader definitions with human-readable labels for better reporting.
- Updated workflows to reflect changes in evaluation execution.

Co-authored-by: Cursor <cursoragent@cursor.com>
…RL extraction

- Updated the connect command to utilize dashboard OAuth by omitting the `--keyless` flag.
- Enhanced URL extraction functionality to include mailto links.
- Improved grading logic to ensure proper validation of dashboard OAuth usage.
- Refactored scenarios and documentation to reflect changes in onboarding requirements and best practices.

Co-authored-by: Cursor <cursoragent@cursor.com>
…e triage and shell command handling

- Added a section in the README for triaging failing scenarios using the `triage-agent-eval-failures` skill.
- Introduced a new function `readShellValue` to improve parsing of shell command values with proper handling of quotes and escapes.
- Updated `captureLeadingExports` to capture environment variables from shell commands more effectively.
- Modified grading logic in the `discipline-no-timers` scenario to count actual BashOutput poll calls for accurate evaluation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread libs/agent-evals/src/suites/agent-onboarding/tape.ts
Comment thread libs/agent-evals/src/core/tools.ts
Comment thread libs/agent-evals/src/core/recorder.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
libs/agent-evals/src/core/tools.ts (2)

236-243: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Record polls only after shell ID validation.

Line 237 mutates poll state before the shellId existence check (Line 241), so invalid IDs still affect run metrics and grader outcomes.

🐛 Proposed fix
     execute: async ({ shellId }) => {
       context.recorder.recordToolCall('BashOutput', { shellId });
-      context.recorder.recordPoll(shellId);

       const shell = context.engine.pollShell(shellId);

       if (!shell) {
         return { error: `Unknown shell id: ${shellId}`, stdout: '', completed: true, exitCode: 1 };
       }
+      context.recorder.recordPoll(shellId);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/agent-evals/src/core/tools.ts` around lines 236 - 243, The recordPoll
method is being called on line 237 before the shell ID is validated on line 241,
causing invalid shell IDs to still record poll metrics. Move the
context.recorder.recordPoll(shellId) call to after the validation check (after
the if (!shell) block) so that polls are only recorded for valid, existing shell
IDs.

251-257: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Sentinel-file reads bypass fixture path safety checks.

Line 256 reads a captured path directly from shell output, bypassing fixture-root validation and allowing out-of-scope file reads.

🔒 Proposed fix
         if (match?.[1]) {
           try {
-            const fileContents = await fs.readFile(match[1], 'utf8');
+            const fileContents = await readFixtureFile(context.scenario.projectRoot, match[1]);

             for (const url of extractUrls(fileContents)) {
               context.recorder.recordUrl(url);
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/agent-evals/src/core/tools.ts` around lines 251 - 257, The file path
captured from shell output (match[1]) in the loop over
context.suite.sentinelFilePatterns is being read directly without validation
against the fixture root directory. Validate that the matched file path
(match[1]) is within the fixture root scope before calling fs.readFile. Resolve
the file path against a fixture-root reference and ensure the resolved path
remains within the fixture-root boundaries to prevent out-of-scope file access.
libs/agent-evals/src/core/mock-shell.ts (1)

32-53: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden tracked-command parsing to avoid scenario crashes.

Line 36 can throw from parser.parse(...), which currently escapes createShell and can abort the whole eval run. Also, Line 41 should use parsed !== null (not truthiness).

🐛 Proposed fix
   createShell(command: string, runInBackground: boolean, env: Record<string, string>): MockShellState<TParsed> {
     this.shellCounter += 1;
     const id = `shell-${this.shellCounter}`;
     const isTracked = this.parser.matches(command);
-    const parsed = isTracked ? this.parser.parse(command, env) : null;
+    let parsed: TParsed | null = null;
+    let parseError: string | null = null;
+
+    if (isTracked) {
+      try {
+        parsed = this.parser.parse(command, env);
+      } catch (error) {
+        parseError = error instanceof Error ? error.message : String(error);
+      }
+    }

     let chunks: string[] = [];
     let exitCode: number | null = null;

-    if (isTracked && parsed && this.scenario.tape) {
+    if (isTracked && parseError) {
+      chunks = [`✗ Failed to parse tracked command: ${parseError}`];
+      exitCode = 1;
+    } else if (isTracked && parsed !== null && this.scenario.tape) {
       const validationError = this.scenario.tape.validate?.(parsed) ?? null;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/agent-evals/src/core/mock-shell.ts` around lines 32 - 53, The
createShell method has two issues that can cause crashes or incorrect logic:
First, the parser.parse() call on line 36 can throw an exception that escapes
the method and terminates the evaluation run, so wrap this call in a try-catch
block to handle parsing errors gracefully (treat parsing failures as untracked
commands or set appropriate error state). Second, on line 41 the condition
checks truthiness of parsed but should use explicit null comparison with parsed
!== null instead, since parsed can be null or an object value that needs proper
type checking.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@libs/agent-evals/package.json`:
- Around line 8-10: The dependency `@ai-sdk/anthropic`@3.0.10 in your package.json
violates the minimum release age policy as it was published on June 16, 2026
(less than 3 days old). Locate the `@ai-sdk/anthropic` package in the dependencies
or devDependencies section of package.json and either downgrade it to a version
that was published before June 15, 2026, or remove it entirely until a version
that meets the 3-day minimum release age requirement becomes available.

In `@libs/agent-evals/src/suites/agent-onboarding/graders.test.ts`:
- Line 30: The judge.assess() call in the test is using an unsafe type assertion
`as never` to bypass type checking. Instead of casting to `never`, construct a
proper JudgeContext object with all required fields (input, output, metadata,
session, toolCalls, and harness) that the assess method expects. Remove the `as
never` cast and either provide complete context values or refactor the test to
properly satisfy the JudgeContext type requirements.

---

Duplicate comments:
In `@libs/agent-evals/src/core/mock-shell.ts`:
- Around line 32-53: The createShell method has two issues that can cause
crashes or incorrect logic: First, the parser.parse() call on line 36 can throw
an exception that escapes the method and terminates the evaluation run, so wrap
this call in a try-catch block to handle parsing errors gracefully (treat
parsing failures as untracked commands or set appropriate error state). Second,
on line 41 the condition checks truthiness of parsed but should use explicit
null comparison with parsed !== null instead, since parsed can be null or an
object value that needs proper type checking.

In `@libs/agent-evals/src/core/tools.ts`:
- Around line 236-243: The recordPoll method is being called on line 237 before
the shell ID is validated on line 241, causing invalid shell IDs to still record
poll metrics. Move the context.recorder.recordPoll(shellId) call to after the
validation check (after the if (!shell) block) so that polls are only recorded
for valid, existing shell IDs.
- Around line 251-257: The file path captured from shell output (match[1]) in
the loop over context.suite.sentinelFilePatterns is being read directly without
validation against the fixture root directory. Validate that the matched file
path (match[1]) is within the fixture root scope before calling fs.readFile.
Resolve the file path against a fixture-root reference and ensure the resolved
path remains within the fixture-root boundaries to prevent out-of-scope file
access.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f1589afb-b40c-4388-9c13-a3175518d094

📥 Commits

Reviewing files that changed from the base of the PR and between d3d61fd and fa6efbd.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (41)
  • .cursor/skills/triage-agent-eval-failures/SKILL.md
  • .cursor/skills/triage-agent-eval-failures/reference.md
  • .github/workflows/agent-evals.yml
  • libs/agent-evals/.env.example
  • libs/agent-evals/.gitignore
  • libs/agent-evals/README.md
  • libs/agent-evals/package.json
  • libs/agent-evals/project.json
  • libs/agent-evals/scripts/run-evals.sh
  • libs/agent-evals/src/core/graders.ts
  • libs/agent-evals/src/core/judge.ts
  • libs/agent-evals/src/core/mock-shell.ts
  • libs/agent-evals/src/core/recorder.ts
  • libs/agent-evals/src/core/tools.ts
  • libs/agent-evals/src/core/types.ts
  • libs/agent-evals/src/suites/agent-onboarding/adapters.ts
  • libs/agent-evals/src/suites/agent-onboarding/catalog.ts
  • libs/agent-evals/src/suites/agent-onboarding/connect-parser.ts
  • libs/agent-evals/src/suites/agent-onboarding/graders.test.ts
  • libs/agent-evals/src/suites/agent-onboarding/harness.ts
  • libs/agent-evals/src/suites/agent-onboarding/kit.ts
  • libs/agent-evals/src/suites/agent-onboarding/onboarding.eval.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/tape.ts
  • libs/agent-evals/vitest.config.ts
  • libs/agent-evals/vitest.evals.config.ts
  • packages/shared/docs/agent-onboarding.md
💤 Files with no reviewable changes (5)
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/scenario.ts
✅ Files skipped from review due to trivial changes (2)
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/graders.ts
  • libs/agent-evals/.gitignore
🚧 Files skipped from review as they are similar to previous changes (14)
  • libs/agent-evals/src/suites/agent-onboarding/kit.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/graders.ts
  • .github/workflows/agent-evals.yml
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/graders.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/tape.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/scenario.ts
  • libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/graders.ts
  • libs/agent-evals/src/core/recorder.ts
  • libs/agent-evals/src/core/types.ts
  • libs/agent-evals/src/suites/agent-onboarding/catalog.ts

Comment thread libs/agent-evals/package.json
Comment thread libs/agent-evals/src/suites/agent-onboarding/graders.test.ts
djabarovgeorge and others added 2 commits June 18, 2026 11:06
…tions

- Removed unnecessary triggers for push and schedule events.
- Updated the evaluation job to always enable the LLM judge and specified the source path for agent onboarding evaluations.
Address review feedback on the eval harness:
- tools: segment-safe fixture-root containment (path.relative), route
  sentinel-file reads through the same guard, treat unquoted ;/& as shell
  separators so one-line export+connect commands keep their residual, and
  only record polls for valid shell ids
- recorder: anchor kill-command detection to command-leading invocations and
  ignore quoted argument text in the watcher guard (no false "sleep" rejects)
- mock-shell: guard tracked-command parsing so a parser throw fails the shell
  instead of aborting the scenario
- types: normalize slashes before stripping a leading ./ (Windows .\ paths)
- workflow: pin actions to commit SHAs to satisfy workflow-security-lint
- docs: add language label to fenced block (markdownlint MD040)

Co-authored-by: Cursor <cursoragent@cursor.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libs/agent-evals/src/core/tools.ts (1)

327-345: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Duplicate tool call recording on successful reads.

recordToolCall('Read', ...) is invoked at line 327 (entry) and again at line 339 (after success with bytes metadata). This double-records every successful file read, which can skew graders that count or filter result.toolCalls (e.g., noTimersNoWatchers iterates tool calls to detect patterns).

Other tools (Bash, BashOutput, AskUserQuestion) record exactly once. Align Read with that pattern by recording once with all available metadata.

Proposed fix
   const Read = tool({
     description: 'Read a file from the project workspace.',
     inputSchema: z.object({
       file_path: z.string(),
     }),
     execute: async ({ file_path: filePath }) => {
-      context.recorder.recordToolCall('Read', { file_path: filePath });
-
       if (filePath.includes('/tmp/') || filePath.endsWith('.log')) {
+        context.recorder.recordToolCall('Read', { file_path: filePath });
+
         return { error: 'Reading log files is discouraged in this flow.' };
       }
 
       if (filePath.endsWith('.png')) {
+        context.recorder.recordToolCall('Read', { file_path: filePath });
+
         return { content: '[PNG image omitted by harness]' };
       }
 
       try {
         const content = await readFixtureFile(context.scenario.projectRoot, filePath);
         context.recorder.recordToolCall('Read', { file_path: filePath }, { bytes: content.length });
 
         return { content };
       } catch (error) {
+        context.recorder.recordToolCall('Read', { file_path: filePath });
+
         return { error: error instanceof Error ? error.message : 'Failed to read file.' };
       }
     },
   });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/agent-evals/src/core/tools.ts` around lines 327 - 345, The Read tool in
the recordToolCall pattern is recording twice for successful file reads - once
at the entry point with only file_path and again after successful read with
bytes metadata. This causes double-recording that skews graders. Remove the
initial recordToolCall invocation at the entry of the function and keep only the
single recordToolCall after the readFixtureFile succeeds, which includes both
the file_path and bytes metadata. This aligns the Read tool's recording pattern
with other tools like Bash, BashOutput, and AskUserQuestion that record exactly
once with all available metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@libs/agent-evals/src/core/tools.ts`:
- Around line 327-345: The Read tool in the recordToolCall pattern is recording
twice for successful file reads - once at the entry point with only file_path
and again after successful read with bytes metadata. This causes
double-recording that skews graders. Remove the initial recordToolCall
invocation at the entry of the function and keep only the single recordToolCall
after the readFixtureFile succeeds, which includes both the file_path and bytes
metadata. This aligns the Read tool's recording pattern with other tools like
Bash, BashOutput, and AskUserQuestion that record exactly once with all
available metadata.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8aff5019-cdff-4ff6-9d3e-6ea7650cef68

📥 Commits

Reviewing files that changed from the base of the PR and between 79cf8ff and 2ffc5a1.

📒 Files selected for processing (7)
  • .github/workflows/agent-evals.yml
  • libs/agent-evals/README.md
  • libs/agent-evals/src/core/mock-shell.ts
  • libs/agent-evals/src/core/recorder.ts
  • libs/agent-evals/src/core/tools.ts
  • libs/agent-evals/src/core/types.ts
  • playground/nextjs/.env.example
✅ Files skipped from review due to trivial changes (2)
  • playground/nextjs/.env.example
  • libs/agent-evals/README.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/agent-evals.yml
  • libs/agent-evals/src/core/recorder.ts
  • libs/agent-evals/src/core/mock-shell.ts
  • libs/agent-evals/src/core/types.ts

Comment thread .github/workflows/agent-evals.yml
Trigger the eval job when the harness code (libs/agent-evals/**) or the
workflow itself changes, not only on the playbook doc, so grader/tape/parser/
mock-shell changes are covered. Intentionally omit the global lockfile to avoid
running this LLM-backed, secret-dependent job on every unrelated PR.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread libs/agent-evals/src/suites/agent-onboarding/catalog.ts Outdated
Comment thread .github/workflows/agent-evals.yml
…s fixes NV-8059

Address Greptile re-review:
- catalog: qrHostAware now passes when the agent embeds the QR PNG as an inline
  Markdown image (![..](*.png)) in chat, not only when it opens it via the OS
  viewer — both are playbook-approved host-aware delivery paths
- workflow: add nightly schedule + workflow_dispatch triggers and gate
  NOVU_EVAL_JUDGE to those events so PRs run deterministic graders only, matching
  the README/PR contract

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread libs/agent-evals/src/core/recorder.ts Outdated
@djabarovgeorge djabarovgeorge marked this pull request as ready for review June 18, 2026 11:28
Replace the naive quoted-span regex with a single-pass lexer so the shell
'\'' apostrophe idiom (e.g. 'Bob'\''s sleep coach') no longer leaks words like
sleep/tail/grep to the watcher check and false-fails valid agent descriptions.
Unquoted command words are preserved, so real watcher commands are still caught.

Co-authored-by: Cursor <cursoragent@cursor.com>
Run the Cursor automation webhook after changes land on next via push, not while the PR is open.

Co-authored-by: Cursor <cursoragent@cursor.com>
Revert agent-onboarding-webhook.yml edits; that workflow and its Cursor secrets are outside the agent-evals harness scope.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread libs/agent-evals/src/core/tools.ts
djabarovgeorge and others added 3 commits June 21, 2026 22:53
Remove the NOVU_EVAL_JUDGE flag and its gating so judge graders run
alongside deterministic graders on every run, including CI. Drops the
flag from adapters, the eval suite, the workflow, env example, docs,
and the triage skill.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
… and pending-shell modeling fixes NV-8059

Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
…s NV-8059

Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
Comment thread libs/agent-evals/src/suites/agent-onboarding/harness.ts
…V-8059

The conclusionFirstReport judge required the CLI result to be followed
directly by the next action, but the playbook mandates a 1-2 sentence
recap in between. This caused the grader to fail on every scenario.
Relax the prompt to allow the recap and fail only when the result is
buried under process narration or no next action is surfaced.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ixes NV-8059

The email-handoff, telegram-secure-qr, and persona-infra-exclusion scenarios
have no dashboard signal in their user prompt, so per the onboarding playbook
the agent must default to `--keyless`. Without `requireKeyless: true` the tape
also returns the success chunks for a dashboard-OAuth command, letting an agent
that omits `--keyless` pass every grader despite choosing the wrong auth mode.
Set `requireKeyless: true` so the tape rejects non-keyless commands, matching
the existing keyless-slack-secure scenario (via buildDefaultTape).

Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
…rness-pr-6c42

Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
@djabarovgeorge djabarovgeorge merged commit 0ad474a into next Jun 22, 2026
38 checks passed
@djabarovgeorge djabarovgeorge deleted the feat/agent-evals-harness branch June 22, 2026 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants