feat(agent-evals): add suite-based behavioral eval harness for agent onboarding fixes NV-8059#11589
Conversation
Introduce @novu/agent-evals with a mocked CLI runner, deterministic and LLM judge graders, and an agent-onboarding scenario suite, plus CI to run evals on doc changes and nightly. Co-authored-by: Cursor <cursoragent@cursor.com>
…es NV-8059 Co-authored-by: Cursor <cursoragent@cursor.com>
✅ Deploy Preview for dashboard-v2-novu-staging ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduces the ChangesAgent Evals Harness
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
Actionable comments posted: 15
🧹 Nitpick comments (4)
libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts (1)
29-29: 💤 Low valueConsider consistent boolean coercion style in predicates.
Line 29 uses
!flags.slackConfigTokenwhile lines 33 and 42 useBoolean(flags.slackConfigToken). For consistency and idiomatic TypeScript, consider using truthiness directly in all predicates:{ stdout: 'NOVU_CONNECT_SLACK_AUTHORIZE_URL=https://slack.test/oauth/rerun-token', - when: (flags) => Boolean(flags.slackConfigToken), + when: (flags) => !!flags.slackConfigToken, },{ stdout: [ '✓ Your agent is live.', ' Agent: Slack Rerun Agent (slack-rerun-1)', ' → Check Slack — your agent just messaged you.', ' Dashboard: https://dashboard.novu.test/agents/slack-rerun-1', ].join('\n'), - when: (flags) => Boolean(flags.slackConfigToken), + when: (flags) => !!flags.slackConfigToken, },Also applies to: 33-33, 42-42
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts` at line 29, The predicate logic in the scenario conditions uses inconsistent boolean coercion styles: line 29 uses negation with `!flags.slackConfigToken`, while lines 33 and 42 use explicit Boolean coercion with `Boolean(flags.slackConfigToken)`. Standardize all three predicate checks (at lines 29, 33, and 42) to use consistent truthiness checking. Convert them all to use direct boolean values without explicit Boolean() wrapping or negation operators, relying on JavaScript's natural truthiness evaluation for cleaner, more idiomatic TypeScript code.libs/agent-evals/src/core/types.ts (1)
7-149: 🏗️ Heavy liftBackend object contracts should use
interfacein this.tsmodule.Most exported object-shaped contracts here are declared as
typealiases. For this backend TypeScript code, please migrate object shapes tointerface(keep union/function aliases astype).♻️ Example pattern (apply consistently across the file)
-export type GraderOutcome = { +export interface GraderOutcome { status: GraderResult; reason?: string; -}; +} -export type ToolCallRecord = { +export interface ToolCallRecord { name: string; args: Record<string, unknown>; result?: unknown; timestamp: number; -}; +}As per coding guidelines, on the backend, use
interfacefor type definitions in*.tsfiles.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/agent-evals/src/core/types.ts` around lines 7 - 149, Migrate object-shaped type contracts in this file from type aliases to interface declarations. Convert all object shapes including GraderOutcome, GraderDefinition, ToolCallRecord, ParsedCommand, TapeChunk, Tape, ScriptedAnswer, EvalScenario, RunResult, ScenarioScore, RunnerOptions, MockShellState, CommandParser, RegisteredScenario, and Suite to use the interface keyword instead of type. Keep the GraderFn as a type alias since it defines a function signature. Preserve all generic parameters and property definitions exactly as they are, only changing the declaration syntax from type to interface.Source: Coding guidelines
libs/agent-evals/src/core/run-agent.ts (1)
9-14: ⚡ Quick winUse
interfacefor backend options object shapes.At Line 9,
RunAgentOptionsis an objecttypealias; preferinterfacehere to match backend conventions.As per coding guidelines, "
**/*.{ts,tsx}: On the backend: useinterfacefor type definitions; on the frontend: usetypefor type definitions."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/agent-evals/src/core/run-agent.ts` around lines 9 - 14, The RunAgentOptions definition at line 9 is using the `type` keyword but backend code conventions require `interface` for type definitions of object shapes. Convert the `export type RunAgentOptions<TParsed = ParsedCommand>` declaration to use `interface` instead, maintaining the same generic parameter and all properties (suite, scenario, model, maxSteps) with their existing types and optional modifiers.Source: Coding guidelines
libs/agent-evals/src/core/tools.ts (1)
17-25: ⚡ Quick winUse
interfacefor backend object type declarations.At Line 17,
HarnessContextis declared as atypeobject alias. Repository rules for backend TS favorinterfacefor type definitions.As per coding guidelines, "
**/*.{ts,tsx}: On the backend: useinterfacefor type definitions; on the frontend: usetypefor type definitions."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/agent-evals/src/core/tools.ts` around lines 17 - 25, The HarnessContext type declaration at line 17 uses the `type` keyword instead of `interface`, which violates backend coding guidelines that require `interface` for type definitions. Convert the HarnessContext type alias to an interface by replacing the `type HarnessContext<TParsed = ParsedCommand> = {` syntax with `interface HarnessContext<TParsed = ParsedCommand> {` and ensure the closing brace maintains the same structure.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/agent-evals.yml:
- Around line 28-29: The checkout step using actions/checkout@v4 is missing the
persist-credentials security setting. Add the `persist-credentials: false`
option to the checkout step to prevent the GitHub token from persisting in the
workspace and potentially leaking through artifacts or logs. This is a security
best practice to protect credentials from unintended exposure.
- Around line 48-54: The direct expansion of ${{ inputs.enable_judge }} in the
run script creates a script injection vulnerability. Move inputs.enable_judge to
the environment block by adding an env section at the same level as the run
section, setting a variable like ENABLE_JUDGE to ${{ inputs.enable_judge }}.
Then in the conditional check, replace the direct expansion with a reference to
the environment variable instead of ${{ inputs.enable_judge }}, so the value is
safely isolated from shell execution.
- Line 29: The actions/checkout action is referenced using a version tag (`@v4`)
which creates a supply chain security risk. Replace the `@v4` reference with a
specific 40-character commit SHA to pin the action to an immutable version,
preventing unauthorized updates by the action owner.
In `@libs/agent-evals/package.json`:
- Around line 16-27: The dependencies in the package.json file violate the
minimumReleaseAge policy. Update the version constraints for ai, typescript,
`@ai-sdk/anthropic`, and `@types/node` to use stable releases that meet the age
requirement instead of the recent versions currently specified. Check the policy
documentation to determine the exact minimum age threshold, then replace each
problematic dependency version with an older stable release that was published
at least that many days ago. Verify all four packages (ai, typescript,
`@ai-sdk/anthropic`, and `@types/node`) comply with the policy before finalizing the
changes.
In `@libs/agent-evals/README.md`:
- Around line 141-162: The fenced code block in the README.md file is missing a
language label on the opening fence, which triggers the MD040 markdown linting
error. Add a language identifier like `text` to the opening triple backticks of
the code block that displays the directory structure to satisfy markdown linting
requirements.
In `@libs/agent-evals/src/core/graders.ts`:
- Around line 78-87: The definition.run(result) call in the grading loop lacks
error handling, causing any unhandled exception from a single grader to abort
the entire grading process and lose outcomes for all remaining graders. Wrap the
definition.run(result) invocation and the toOutcome conversion in a try-catch
block, capturing any thrown errors. In the catch block, create an appropriate
failed outcome (with status set to something like 'error' or 'failed') and
assign it to outcomes[name], then still invoke the onGraderResult callback so
the error is properly reported, and continue to the next grader instead of
letting the exception propagate up.
In `@libs/agent-evals/src/core/mock-shell.ts`:
- Around line 39-57: The parser.parse method call on line 41 can throw an error
that escapes the execute method and crashes the scenario. Wrap the
this.parser.parse(command, env) call in a try-catch block to catch any thrown
errors and handle them gracefully by setting chunks to an error message and
exitCode to 1, similar to how validationError is handled. Additionally, replace
the implicit truthiness checks for the parsed variable in the conditional
statements (the conditions checking isTracked && parsed and isTracked &&
!this.scenario.tape) with explicit null checks using parsed !== null to ensure
proper handling of the parsed value.
In `@libs/agent-evals/src/core/recorder.ts`:
- Around line 87-89: The isKillCommand function's regex pattern matches the
kill-related keywords anywhere in the command string, causing false positives
like 'echo kill' to be misclassified as kill commands. Modify the regex pattern
in isKillCommand to anchor it to the beginning of the string so it only matches
when kill, pkill, or killall is the actual command being invoked, not a
substring elsewhere in the command. Use a pattern that starts with a
beginning-of-string anchor and optionally matches leading whitespace before the
command keywords.
In `@libs/agent-evals/src/core/tools.ts`:
- Around line 59-65: The regex pattern at line 59 that matches export statements
only captures the export prefix of composite bash commands like `export X='1' &&
npx novu connect`. When a match is found, the function returns true immediately
without processing any tail commands that follow the `&&` operator, causing
composite commands to be treated as no-ops. To fix this, after matching the
export pattern and setting the environment variable, check if the original
command contains a `&&` operator followed by additional commands. If it does,
extract and return the tail command (the part after `&&`) instead of returning
true, so that the remaining command can be properly processed in subsequent
iterations or execution steps.
- Around line 176-183: The sentinel file reading at line 181 uses fs.readFile
directly to read the matched path without any safety checks, which can allow
access to files outside the workspace fixtures. Replace the
fs.readFile(match[1], 'utf8') call with readFixtureFile(...) function to ensure
the file path is validated and restricted to the intended fixture directory
before reading, maintaining workspace safety boundaries when extracting URLs
from the sentinel file contents.
- Around line 161-168: The recordPoll method is being called before validating
that the shellId is valid, allowing invalid shell IDs to mutate the run state.
Move the context.recorder.recordPoll(shellId) call to execute only after the
shell validation check (after the if (!shell) guard clause that returns early),
ensuring that polling is only recorded for valid shell IDs.
- Around line 51-53: The path containment check at line 51 using
`absolutePath.startsWith(projectRoot)` is insufficient for security because
string prefix matching does not account for path boundaries. A path like
`/root/proj-evil/file` would incorrectly pass the check if projectRoot is
`/root/proj`. Fix this by ensuring proper path boundary validation, such as
verifying that projectRoot ends with a path separator before checking
containment, or by using path normalization and comparison that respects
directory boundaries instead of simple string prefix matching.
In `@libs/agent-evals/src/core/types.ts`:
- Around line 155-156: The normalizePath function processes replacements in the
wrong order, causing Windows-style paths with `.\` prefixes to not be properly
normalized. The current implementation attempts to remove the leading `./`
before converting backslashes to forward slashes, so a path like `.\foo\bar`
keeps its `./` prefix after the backslashes are converted. Swap the order of the
two replace operations in the normalizePath function so that
backslash-to-forward-slash conversion happens first, then remove the leading
`./` prefix.
In `@libs/agent-evals/src/index.ts`:
- Around line 20-22: The parseArgs function has two issues: first, when parsing
flag values like --suite, --judge, and others (at lines 20-22, 26-28, 32-34,
48-50, 69-71), it accesses argv[index + 1] without checking if the value exists
or if it's another flag starting with '-'. Second, the --fail-under flag parsing
(at lines 116-121) accepts NaN values which bypass the numeric comparison gate
because NaN comparisons always return false. Fix this by adding bounds checking
before accessing argv[index + 1] and validating that the next argument is not a
flag, and for --fail-under specifically, validate that the parsed number is not
NaN using Number.isNaN() and reject invalid input appropriately.
In
`@libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.ts`:
- Line 6: The scenario currently uses a single qrPath variable pointing to
telegram-setup-qr.png for both NOVU_CONNECT_TELEGRAM_SETUP_QR_PNG and
NOVU_CONNECT_TELEGRAM_DEEPLINK_QR_PNG environment variables. Since these
represent distinct QR codes with different handlers in production, create a
separate deeplink QR fixture file named telegram-deeplink-qr.png in the project
fixture directory, then create a separate variable for the deeplink QR path and
use it to set NOVU_CONNECT_TELEGRAM_DEEPLINK_QR_PNG instead of reusing qrPath.
---
Nitpick comments:
In `@libs/agent-evals/src/core/run-agent.ts`:
- Around line 9-14: The RunAgentOptions definition at line 9 is using the `type`
keyword but backend code conventions require `interface` for type definitions of
object shapes. Convert the `export type RunAgentOptions<TParsed =
ParsedCommand>` declaration to use `interface` instead, maintaining the same
generic parameter and all properties (suite, scenario, model, maxSteps) with
their existing types and optional modifiers.
In `@libs/agent-evals/src/core/tools.ts`:
- Around line 17-25: The HarnessContext type declaration at line 17 uses the
`type` keyword instead of `interface`, which violates backend coding guidelines
that require `interface` for type definitions. Convert the HarnessContext type
alias to an interface by replacing the `type HarnessContext<TParsed =
ParsedCommand> = {` syntax with `interface HarnessContext<TParsed =
ParsedCommand> {` and ensure the closing brace maintains the same structure.
In `@libs/agent-evals/src/core/types.ts`:
- Around line 7-149: Migrate object-shaped type contracts in this file from type
aliases to interface declarations. Convert all object shapes including
GraderOutcome, GraderDefinition, ToolCallRecord, ParsedCommand, TapeChunk, Tape,
ScriptedAnswer, EvalScenario, RunResult, ScenarioScore, RunnerOptions,
MockShellState, CommandParser, RegisteredScenario, and Suite to use the
interface keyword instead of type. Keep the GraderFn as a type alias since it
defines a function signature. Preserve all generic parameters and property
definitions exactly as they are, only changing the declaration syntax from type
to interface.
In
`@libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts`:
- Line 29: The predicate logic in the scenario conditions uses inconsistent
boolean coercion styles: line 29 uses negation with `!flags.slackConfigToken`,
while lines 33 and 42 use explicit Boolean coercion with
`Boolean(flags.slackConfigToken)`. Standardize all three predicate checks (at
lines 29, 33, and 42) to use consistent truthiness checking. Convert them all to
use direct boolean values without explicit Boolean() wrapping or negation
operators, relying on JavaScript's natural truthiness evaluation for cleaner,
more idiomatic TypeScript code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4e426a29-a422-465c-ab13-f7710c385864
⛔ Files ignored due to path filters (2)
libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/telegram-setup-qr.pngis excluded by!**/*.pngpnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (62)
.github/workflows/agent-evals.ymllibs/agent-evals/.env.examplelibs/agent-evals/.gitignorelibs/agent-evals/README.mdlibs/agent-evals/package.jsonlibs/agent-evals/project.jsonlibs/agent-evals/scripts/run-evals.shlibs/agent-evals/src/core/graders.tslibs/agent-evals/src/core/judge.tslibs/agent-evals/src/core/mock-shell.tslibs/agent-evals/src/core/recorder.tslibs/agent-evals/src/core/reporters.tslibs/agent-evals/src/core/resolve-package-file.tslibs/agent-evals/src/core/run-agent.tslibs/agent-evals/src/core/runner.tslibs/agent-evals/src/core/tools.tslibs/agent-evals/src/core/types.tslibs/agent-evals/src/index.tslibs/agent-evals/src/load-env.tslibs/agent-evals/src/self-test.tslibs/agent-evals/src/suites/agent-onboarding/catalog.tslibs/agent-evals/src/suites/agent-onboarding/connect-parser.tslibs/agent-evals/src/suites/agent-onboarding/index.tslibs/agent-evals/src/suites/agent-onboarding/kit.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/README.mdlibs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/novu-connect-auth-url.txtlibs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/project/package.jsonlibs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/README.mdlibs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/project/package.jsonlibs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/README.mdlibs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/project/package.jsonlibs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/README.mdlibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/project/package.jsonlibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/README.mdlibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/project/package.jsonlibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/README.mdlibs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/project/package.jsonlibs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/README.mdlibs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/novu-connect-auth-url.txtlibs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/project/package.jsonlibs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/README.mdlibs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/project/package.jsonlibs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.tslibs/agent-evals/src/suites/agent-onboarding/tape.tslibs/agent-evals/src/suites/registry.tslibs/agent-evals/tsconfig.jsonpackages/shared/package.json
…ing system - Updated the agent-evals harness to utilize vitest for running evaluations. - Introduced new environment variables for LLM judge configuration. - Removed legacy CLI entry point and refactored grading logic to improve clarity and maintainability. - Enhanced grader definitions with human-readable labels for better reporting. - Updated workflows to reflect changes in evaluation execution. Co-authored-by: Cursor <cursoragent@cursor.com>
…RL extraction - Updated the connect command to utilize dashboard OAuth by omitting the `--keyless` flag. - Enhanced URL extraction functionality to include mailto links. - Improved grading logic to ensure proper validation of dashboard OAuth usage. - Refactored scenarios and documentation to reflect changes in onboarding requirements and best practices. Co-authored-by: Cursor <cursoragent@cursor.com>
…e triage and shell command handling - Added a section in the README for triaging failing scenarios using the `triage-agent-eval-failures` skill. - Introduced a new function `readShellValue` to improve parsing of shell command values with proper handling of quotes and escapes. - Updated `captureLeadingExports` to capture environment variables from shell commands more effectively. - Modified grading logic in the `discipline-no-timers` scenario to count actual BashOutput poll calls for accurate evaluation. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
libs/agent-evals/src/core/tools.ts (2)
236-243:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRecord polls only after shell ID validation.
Line 237 mutates poll state before the
shellIdexistence check (Line 241), so invalid IDs still affect run metrics and grader outcomes.🐛 Proposed fix
execute: async ({ shellId }) => { context.recorder.recordToolCall('BashOutput', { shellId }); - context.recorder.recordPoll(shellId); const shell = context.engine.pollShell(shellId); if (!shell) { return { error: `Unknown shell id: ${shellId}`, stdout: '', completed: true, exitCode: 1 }; } + context.recorder.recordPoll(shellId);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/agent-evals/src/core/tools.ts` around lines 236 - 243, The recordPoll method is being called on line 237 before the shell ID is validated on line 241, causing invalid shell IDs to still record poll metrics. Move the context.recorder.recordPoll(shellId) call to after the validation check (after the if (!shell) block) so that polls are only recorded for valid, existing shell IDs.
251-257:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winSentinel-file reads bypass fixture path safety checks.
Line 256 reads a captured path directly from shell output, bypassing fixture-root validation and allowing out-of-scope file reads.
🔒 Proposed fix
if (match?.[1]) { try { - const fileContents = await fs.readFile(match[1], 'utf8'); + const fileContents = await readFixtureFile(context.scenario.projectRoot, match[1]); for (const url of extractUrls(fileContents)) { context.recorder.recordUrl(url); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/agent-evals/src/core/tools.ts` around lines 251 - 257, The file path captured from shell output (match[1]) in the loop over context.suite.sentinelFilePatterns is being read directly without validation against the fixture root directory. Validate that the matched file path (match[1]) is within the fixture root scope before calling fs.readFile. Resolve the file path against a fixture-root reference and ensure the resolved path remains within the fixture-root boundaries to prevent out-of-scope file access.libs/agent-evals/src/core/mock-shell.ts (1)
32-53:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHarden tracked-command parsing to avoid scenario crashes.
Line 36 can throw from
parser.parse(...), which currently escapescreateShelland can abort the whole eval run. Also, Line 41 should useparsed !== null(not truthiness).🐛 Proposed fix
createShell(command: string, runInBackground: boolean, env: Record<string, string>): MockShellState<TParsed> { this.shellCounter += 1; const id = `shell-${this.shellCounter}`; const isTracked = this.parser.matches(command); - const parsed = isTracked ? this.parser.parse(command, env) : null; + let parsed: TParsed | null = null; + let parseError: string | null = null; + + if (isTracked) { + try { + parsed = this.parser.parse(command, env); + } catch (error) { + parseError = error instanceof Error ? error.message : String(error); + } + } let chunks: string[] = []; let exitCode: number | null = null; - if (isTracked && parsed && this.scenario.tape) { + if (isTracked && parseError) { + chunks = [`✗ Failed to parse tracked command: ${parseError}`]; + exitCode = 1; + } else if (isTracked && parsed !== null && this.scenario.tape) { const validationError = this.scenario.tape.validate?.(parsed) ?? null;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/agent-evals/src/core/mock-shell.ts` around lines 32 - 53, The createShell method has two issues that can cause crashes or incorrect logic: First, the parser.parse() call on line 36 can throw an exception that escapes the method and terminates the evaluation run, so wrap this call in a try-catch block to handle parsing errors gracefully (treat parsing failures as untracked commands or set appropriate error state). Second, on line 41 the condition checks truthiness of parsed but should use explicit null comparison with parsed !== null instead, since parsed can be null or an object value that needs proper type checking.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@libs/agent-evals/package.json`:
- Around line 8-10: The dependency `@ai-sdk/anthropic`@3.0.10 in your package.json
violates the minimum release age policy as it was published on June 16, 2026
(less than 3 days old). Locate the `@ai-sdk/anthropic` package in the dependencies
or devDependencies section of package.json and either downgrade it to a version
that was published before June 15, 2026, or remove it entirely until a version
that meets the 3-day minimum release age requirement becomes available.
In `@libs/agent-evals/src/suites/agent-onboarding/graders.test.ts`:
- Line 30: The judge.assess() call in the test is using an unsafe type assertion
`as never` to bypass type checking. Instead of casting to `never`, construct a
proper JudgeContext object with all required fields (input, output, metadata,
session, toolCalls, and harness) that the assess method expects. Remove the `as
never` cast and either provide complete context values or refactor the test to
properly satisfy the JudgeContext type requirements.
---
Duplicate comments:
In `@libs/agent-evals/src/core/mock-shell.ts`:
- Around line 32-53: The createShell method has two issues that can cause
crashes or incorrect logic: First, the parser.parse() call on line 36 can throw
an exception that escapes the method and terminates the evaluation run, so wrap
this call in a try-catch block to handle parsing errors gracefully (treat
parsing failures as untracked commands or set appropriate error state). Second,
on line 41 the condition checks truthiness of parsed but should use explicit
null comparison with parsed !== null instead, since parsed can be null or an
object value that needs proper type checking.
In `@libs/agent-evals/src/core/tools.ts`:
- Around line 236-243: The recordPoll method is being called on line 237 before
the shell ID is validated on line 241, causing invalid shell IDs to still record
poll metrics. Move the context.recorder.recordPoll(shellId) call to after the
validation check (after the if (!shell) block) so that polls are only recorded
for valid, existing shell IDs.
- Around line 251-257: The file path captured from shell output (match[1]) in
the loop over context.suite.sentinelFilePatterns is being read directly without
validation against the fixture root directory. Validate that the matched file
path (match[1]) is within the fixture root scope before calling fs.readFile.
Resolve the file path against a fixture-root reference and ensure the resolved
path remains within the fixture-root boundaries to prevent out-of-scope file
access.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f1589afb-b40c-4388-9c13-a3175518d094
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (41)
.cursor/skills/triage-agent-eval-failures/SKILL.md.cursor/skills/triage-agent-eval-failures/reference.md.github/workflows/agent-evals.ymllibs/agent-evals/.env.examplelibs/agent-evals/.gitignorelibs/agent-evals/README.mdlibs/agent-evals/package.jsonlibs/agent-evals/project.jsonlibs/agent-evals/scripts/run-evals.shlibs/agent-evals/src/core/graders.tslibs/agent-evals/src/core/judge.tslibs/agent-evals/src/core/mock-shell.tslibs/agent-evals/src/core/recorder.tslibs/agent-evals/src/core/tools.tslibs/agent-evals/src/core/types.tslibs/agent-evals/src/suites/agent-onboarding/adapters.tslibs/agent-evals/src/suites/agent-onboarding/catalog.tslibs/agent-evals/src/suites/agent-onboarding/connect-parser.tslibs/agent-evals/src/suites/agent-onboarding/graders.test.tslibs/agent-evals/src/suites/agent-onboarding/harness.tslibs/agent-evals/src/suites/agent-onboarding/kit.tslibs/agent-evals/src/suites/agent-onboarding/onboarding.eval.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/graders.tslibs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.tslibs/agent-evals/src/suites/agent-onboarding/tape.tslibs/agent-evals/vitest.config.tslibs/agent-evals/vitest.evals.config.tspackages/shared/docs/agent-onboarding.md
💤 Files with no reviewable changes (5)
- libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/scenario.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/scenario.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/scenario.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/scenario.ts
✅ Files skipped from review due to trivial changes (2)
- libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-whatsapp-redirect/graders.ts
- libs/agent-evals/.gitignore
🚧 Files skipped from review as they are similar to previous changes (14)
- libs/agent-evals/src/suites/agent-onboarding/kit.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/persona-infra-exclusion/graders.ts
- .github/workflows/agent-evals.yml
- libs/agent-evals/src/suites/agent-onboarding/scenarios/email-handoff/graders.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/keyless-slack-secure/graders.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/telegram-secure-qr/graders.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/graders.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/slack-in-chat-rerun/scenario.ts
- libs/agent-evals/src/suites/agent-onboarding/tape.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/scenario.ts
- libs/agent-evals/src/suites/agent-onboarding/scenarios/dashboard-prompt-login/graders.ts
- libs/agent-evals/src/core/recorder.ts
- libs/agent-evals/src/core/types.ts
- libs/agent-evals/src/suites/agent-onboarding/catalog.ts
…tions - Removed unnecessary triggers for push and schedule events. - Updated the evaluation job to always enable the LLM judge and specified the source path for agent onboarding evaluations.
Address review feedback on the eval harness: - tools: segment-safe fixture-root containment (path.relative), route sentinel-file reads through the same guard, treat unquoted ;/& as shell separators so one-line export+connect commands keep their residual, and only record polls for valid shell ids - recorder: anchor kill-command detection to command-leading invocations and ignore quoted argument text in the watcher guard (no false "sleep" rejects) - mock-shell: guard tracked-command parsing so a parser throw fails the shell instead of aborting the scenario - types: normalize slashes before stripping a leading ./ (Windows .\ paths) - workflow: pin actions to commit SHAs to satisfy workflow-security-lint - docs: add language label to fenced block (markdownlint MD040) Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
libs/agent-evals/src/core/tools.ts (1)
327-345:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDuplicate tool call recording on successful reads.
recordToolCall('Read', ...)is invoked at line 327 (entry) and again at line 339 (after success withbytesmetadata). This double-records every successful file read, which can skew graders that count or filterresult.toolCalls(e.g.,noTimersNoWatchersiterates tool calls to detect patterns).Other tools (
Bash,BashOutput,AskUserQuestion) record exactly once. AlignReadwith that pattern by recording once with all available metadata.Proposed fix
const Read = tool({ description: 'Read a file from the project workspace.', inputSchema: z.object({ file_path: z.string(), }), execute: async ({ file_path: filePath }) => { - context.recorder.recordToolCall('Read', { file_path: filePath }); - if (filePath.includes('/tmp/') || filePath.endsWith('.log')) { + context.recorder.recordToolCall('Read', { file_path: filePath }); + return { error: 'Reading log files is discouraged in this flow.' }; } if (filePath.endsWith('.png')) { + context.recorder.recordToolCall('Read', { file_path: filePath }); + return { content: '[PNG image omitted by harness]' }; } try { const content = await readFixtureFile(context.scenario.projectRoot, filePath); context.recorder.recordToolCall('Read', { file_path: filePath }, { bytes: content.length }); return { content }; } catch (error) { + context.recorder.recordToolCall('Read', { file_path: filePath }); + return { error: error instanceof Error ? error.message : 'Failed to read file.' }; } }, });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/agent-evals/src/core/tools.ts` around lines 327 - 345, The Read tool in the recordToolCall pattern is recording twice for successful file reads - once at the entry point with only file_path and again after successful read with bytes metadata. This causes double-recording that skews graders. Remove the initial recordToolCall invocation at the entry of the function and keep only the single recordToolCall after the readFixtureFile succeeds, which includes both the file_path and bytes metadata. This aligns the Read tool's recording pattern with other tools like Bash, BashOutput, and AskUserQuestion that record exactly once with all available metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@libs/agent-evals/src/core/tools.ts`:
- Around line 327-345: The Read tool in the recordToolCall pattern is recording
twice for successful file reads - once at the entry point with only file_path
and again after successful read with bytes metadata. This causes
double-recording that skews graders. Remove the initial recordToolCall
invocation at the entry of the function and keep only the single recordToolCall
after the readFixtureFile succeeds, which includes both the file_path and bytes
metadata. This aligns the Read tool's recording pattern with other tools like
Bash, BashOutput, and AskUserQuestion that record exactly once with all
available metadata.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8aff5019-cdff-4ff6-9d3e-6ea7650cef68
📒 Files selected for processing (7)
.github/workflows/agent-evals.ymllibs/agent-evals/README.mdlibs/agent-evals/src/core/mock-shell.tslibs/agent-evals/src/core/recorder.tslibs/agent-evals/src/core/tools.tslibs/agent-evals/src/core/types.tsplayground/nextjs/.env.example
✅ Files skipped from review due to trivial changes (2)
- playground/nextjs/.env.example
- libs/agent-evals/README.md
🚧 Files skipped from review as they are similar to previous changes (4)
- .github/workflows/agent-evals.yml
- libs/agent-evals/src/core/recorder.ts
- libs/agent-evals/src/core/mock-shell.ts
- libs/agent-evals/src/core/types.ts
Trigger the eval job when the harness code (libs/agent-evals/**) or the workflow itself changes, not only on the playbook doc, so grader/tape/parser/ mock-shell changes are covered. Intentionally omit the global lockfile to avoid running this LLM-backed, secret-dependent job on every unrelated PR. Co-authored-by: Cursor <cursoragent@cursor.com>
…s fixes NV-8059 Address Greptile re-review: - catalog: qrHostAware now passes when the agent embeds the QR PNG as an inline Markdown image () in chat, not only when it opens it via the OS viewer — both are playbook-approved host-aware delivery paths - workflow: add nightly schedule + workflow_dispatch triggers and gate NOVU_EVAL_JUDGE to those events so PRs run deterministic graders only, matching the README/PR contract Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the naive quoted-span regex with a single-pass lexer so the shell '\'' apostrophe idiom (e.g. 'Bob'\''s sleep coach') no longer leaks words like sleep/tail/grep to the watcher check and false-fails valid agent descriptions. Unquoted command words are preserved, so real watcher commands are still caught. Co-authored-by: Cursor <cursoragent@cursor.com>
Run the Cursor automation webhook after changes land on next via push, not while the PR is open. Co-authored-by: Cursor <cursoragent@cursor.com>
Revert agent-onboarding-webhook.yml edits; that workflow and its Cursor secrets are outside the agent-evals harness scope. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the NOVU_EVAL_JUDGE flag and its gating so judge graders run alongside deterministic graders on every run, including CI. Drops the flag from adapters, the eval suite, the workflow, env example, docs, and the triage skill. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
… and pending-shell modeling fixes NV-8059 Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
…s NV-8059 Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
…V-8059 The conclusionFirstReport judge required the CLI result to be followed directly by the next action, but the playbook mandates a 1-2 sentence recap in between. This caused the grader to fail on every scenario. Relax the prompt to allow the recap and fail only when the result is buried under process narration or no next action is surfaced. Co-authored-by: Cursor <cursoragent@cursor.com>
…ixes NV-8059 The email-handoff, telegram-secure-qr, and persona-infra-exclusion scenarios have no dashboard signal in their user prompt, so per the onboarding playbook the agent must default to `--keyless`. Without `requireKeyless: true` the tape also returns the success chunks for a dashboard-OAuth command, letting an agent that omits `--keyless` pass every grader despite choosing the wrong auth mode. Set `requireKeyless: true` so the tape rejects non-keyless commands, matching the existing keyless-slack-secure scenario (via buildDefaultTape). Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
…rness-pr-6c42 Co-authored-by: George Djabarov <djabarovgeorge@users.noreply.github.com>
Summary
@novu/agent-evals— a suite-based behavioral eval harness that runs a real LLM agent against scripted scenarios with a mocked CLI, then grades playbook adherence via deterministic checks and LLM-as-judge graders.agent-onboarding) covers 8 scenarios forpackages/shared/docs/agent-onboarding.md(thenpx novu connectflow).@novu/shared/docs/agent-onboarding.mdso the harness resolves the canonical playbook via package export.Why
Regression-test agent onboarding playbook behavior (connect flags, discipline, persona) before shipping doc or dashboard prompt changes.
Required CI secrets / env
Add in GitHub → Settings → Secrets and variables → Actions before merging:
ANTHROPIC_API_KEYagent-evals.ymlnextJudge graders always run when evals run (no
NOVU_EVAL_JUDGEtoggle). Optional overrides:NOVU_EVAL_MODEL,NOVU_EVAL_JUDGE_MODEL.Architecture
flowchart TB subgraph entry["Entry (vitest)"] Eval["onboarding.eval.ts\ndescribeEval per scenario"] Adapters["adapters.ts\ngrader → judge"] end subgraph core["Core simulation (src/core/)"] Harness["harness.ts\ncreateHarness + AI SDK loop"] Tools["tools.ts\nBash · BashOutput · AskUserQuestion · Read"] MockShell["mock-shell.ts\nTape replay engine"] Recorder["recorder.ts\nRunResult builder"] Graders["graders.ts\ndefineGraders · contains · judge"] Judge["judge.ts\nLLM-as-judge"] end subgraph suite["Suite (src/suites/agent-onboarding/)"] Scenarios["scenarios/{id}/\nscenario.ts · graders.ts · project/"] Parser["connect-parser.ts"] Tape["tape.ts"] Catalog["catalog.ts"] end Eval --> Harness Eval --> Adapters Adapters --> Graders Adapters --> Judge Harness --> Tools Tools --> MockShell Tools --> Recorder Harness --> Recorder Scenarios --> Eval Parser --> MockShell Tape --> MockShellTest plan
pnpm --filter @novu/agent-evals checkpnpm --filter @novu/agent-evals testpnpm --filter @novu/agent-evals evalwithANTHROPIC_API_KEY(runs deterministic + judge graders)pnpm --filter @novu/agent-evals exec vitest run --config vitest.evals.config.ts -t keyless-slack-secureLinear: https://linear.app/novu/issue/NV-8059/add-suite-based-behavioral-eval-harness-for-agent-onboarding-playbook
Greptile Summary
This PR introduces
@novu/agent-evals, a suite-based behavioral eval harness that runs a real LLM agent against 8 scripted onboarding scenarios with a mocked CLI, then grades playbook adherence via deterministic checks and LLM-as-judge graders. It also exports@novu/shared/docs/agent-onboarding.mdfor canonical playbook resolution and adds a path-triggered CI workflow.libs/agent-evals/src/core/): MockShellEngine with tape-replay, shell-escape-aware export capture, fixture path containment viapath.relative, and a single-pass quote/escape lexer in the watcher guard. Most correctness issues from prior review rounds have been addressed.src/suites/agent-onboarding/): 8 scenarios covering keyless, dashboard-OAuth, discipline, and persona paths. The shell-word tokenizer inconnect-parser.tsnow strips quotes and handles positional descriptions anywhere in the command;buildDefaultTapenow enforcesrequireKeylessby default; three scenarios that were previously missingrequireKeyless: truehave been fixed — butdiscipline-no-timersstill omits this guard.apps/api,apps/dashboard,packages/chat-adapter, and the playground; no logic changes.Confidence Score: 4/5
Safe to merge with one scenario correctness fix outstanding: discipline-no-timers accepts the wrong auth path.
Most correctness issues from prior review rounds have been addressed — shell-word tokenizer, export capture, path guard, watcher guard, tape keyless enforcement for three scenarios, and the QR host-aware check. One scenario (discipline-no-timers) was missed in the keyless enforcement sweep: its tape accepts a command without --keyless even though the user prompt has no dashboard login, so an agent can pass all four graders on the wrong auth path. The follow-up injection via toolResult.output is also a known unresolved gap, but its impact is isolated to the slack-in-chat-rerun scenario's in-chat token path. The non-eval code changes are pure formatting with no logic impact.
libs/agent-evals/src/suites/agent-onboarding/scenarios/discipline-no-timers/scenario.ts needs requireKeyless: true. libs/agent-evals/src/suites/agent-onboarding/harness.ts — the toolResult.output vs .result discrepancy for follow-up injection remains open.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[onboarding.eval.ts\ndescribeEval per scenario] --> B[scenarioHarness\ngenerateText loop] B --> C[createHarnessTools\nBash · BashOutput · AskUserQuestion · Read] C --> D{Command type} D -->|export VAR=...| E[captureLeadingExports\nstores to context.env] D -->|novu connect ...| F[MockShellEngine.createShell] D -->|kill/pkill| G[MockShellEngine.killShell] D -->|open/xdg-open| H[recorder.recordOpenedFile] F --> I[connectParser.parse\ntokenizeShellWords · readFlagValue\nfindConnectPositional] I --> J[connectValidate\nrequireKeyless · allowedChannels · --ci] J -->|valid| K[selectTapeChunks\npendingWhen check] J -->|invalid| L[exitCode=1 error chunk] K --> M[pollShell / BashOutput polling] M --> N[RunRecorder.build\nRunResult] N --> O[Graders\ndeterministic · judge] O --> P{pass / fail / skip} B -->|follow-up injection| Q[shouldInjectFollowUp\nfollowUpTextPattern OR\ntoolResult.output ⚠️]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[onboarding.eval.ts\ndescribeEval per scenario] --> B[scenarioHarness\ngenerateText loop] B --> C[createHarnessTools\nBash · BashOutput · AskUserQuestion · Read] C --> D{Command type} D -->|export VAR=...| E[captureLeadingExports\nstores to context.env] D -->|novu connect ...| F[MockShellEngine.createShell] D -->|kill/pkill| G[MockShellEngine.killShell] D -->|open/xdg-open| H[recorder.recordOpenedFile] F --> I[connectParser.parse\ntokenizeShellWords · readFlagValue\nfindConnectPositional] I --> J[connectValidate\nrequireKeyless · allowedChannels · --ci] J -->|valid| K[selectTapeChunks\npendingWhen check] J -->|invalid| L[exitCode=1 error chunk] K --> M[pollShell / BashOutput polling] M --> N[RunRecorder.build\nRunResult] N --> O[Graders\ndeterministic · judge] O --> P{pass / fail / skip} B -->|follow-up injection| Q[shouldInjectFollowUp\nfollowUpTextPattern OR\ntoolResult.output ⚠️]Reviews (10): Last reviewed commit: "Merge remote-tracking branch 'origin/nex..." | Re-trigger Greptile