Skip to content

feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog)#28289

Draft
schrothbn wants to merge 1 commit intomasterfrom
trust-1-feature-instanceai-sub-agent-evaluation-harness
Draft

feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog)#28289
schrothbn wants to merge 1 commit intomasterfrom
trust-1-feature-instanceai-sub-agent-evaluation-harness

Conversation

@schrothbn
Copy link
Copy Markdown
Contributor

@schrothbn schrothbn commented Apr 10, 2026

Adds a sub-agent evaluation harness that runs builder sub-agents with a real LLM and stubbed n8n services, then evaluates the resulting workflows using binary checks (pass/fail).

Binary checks (evaluations/binaryChecks/):

  • 19 deterministic checks validating workflow structure: trigger presence, node connectivity, expression references, credential hygiene, unreachable nodes, valid field references, etc.
  • 6 LLM-backed checks (auto-skipped when no model configured): fulfills user request, valid data flow, correct node operations,descriptive names, etc.
  • Each check produces a Feedback item with score 0 or 1, plus an overall pass rate.

Sub-agent runner (evaluations/subagent/):

  • Instantiates a builder sub-agent with a real LLM and stubbed service context (no running n8n instance needed).
  • Captures the built workflow and runs binary checks against it.
  • CLI (pnpm eval:subagent) supports --filter, --prompt, --model, --verbose, and LangSmith dataset integration (--dataset,
    --experiment).

Other changes:

  • New eval:subagent and eval:e2e npm scripts in package.json.

Related Linear tickets, Github issues, and Community forum posts

https://linear.app/n8n/issue/TRUST-1

Review / Merge checklist

  • I have seen this code, I have run this code, and I take responsibility for this code.
  • PR title and summary are descriptive. (conventions)
  • Docs updated or follow-up ticket created.
  • Tests included.
  • PR Labeled with Backport to Beta, Backport to Stable, or Backport to v1 (if the PR is an urgent fix that needs to be backported)

Add isolated sub-agent runner that instantiates builder agents with
stubbed services, runs them to completion, and evaluates the resulting
workflows. Includes 18 deterministic and 6 LLM binary checks ported
from ai-workflow-builder.ee.

Deterministic checks validate structural correctness (connectivity,
triggers, required params, expressions, AI node wiring, etc.).
LLM checks evaluate semantic quality (fulfills request, correct
operations, data flow, naming, item handling, response accuracy).

Includes LangSmith integration, HTML reporting, CLI entry point,
in-memory sandbox stubs, and credential seeding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

⚠️ PR exceeds size limit (3,273 lines added)

This PR adds 3,273 lines, exceeding the 1,000-line limit.

Large PRs are harder to review and increase the risk of bugs going unnoticed. Please consider:

  • Breaking this into smaller, logically separate PRs
  • Moving unrelated changes to a follow-up PR

If the size is genuinely justified (e.g. generated code, large migrations, test fixtures), a maintainer can override by commenting /size-limit-override and then pushing a new commit or re-running this check.

@schrothbn schrothbn changed the title feat(instance-ai): Add sub-agent evaluation harness with binary checks (no-changelog) feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog) Apr 10, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@n8n-assistant n8n-assistant bot added the n8n team Authored by the n8n team label Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

n8n team Authored by the n8n team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant