feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog) by schrothbn · Pull Request #28289 · n8n-io/n8n

schrothbn · 2026-04-10T07:39:25Z

Adds a sub-agent evaluation harness that runs builder sub-agents with a real LLM and stubbed n8n services, then evaluates the resulting workflows using binary checks (pass/fail).

Binary checks (evaluations/binaryChecks/):

19 deterministic checks validating workflow structure: trigger presence, node connectivity, expression references, credential hygiene, unreachable nodes, valid field references, etc.
6 LLM-backed checks (auto-skipped when no model configured): fulfills user request, valid data flow, correct node operations,descriptive names, etc.
Each check produces a Feedback item with score 0 or 1, plus an overall pass rate.

Sub-agent runner (evaluations/subagent/):

Instantiates a builder sub-agent with a real LLM and stubbed service context (no running n8n instance needed).
Captures the built workflow and runs binary checks against it.
CLI (pnpm eval:subagent) supports --filter, --prompt, --model, --verbose, and LangSmith dataset integration (--dataset,
--experiment).

Other changes:

New eval:subagent and eval:e2e npm scripts in package.json.

Related Linear tickets, Github issues, and Community forum posts

https://linear.app/n8n/issue/TRUST-1

Review / Merge checklist

I have seen this code, I have run this code, and I take responsibility for this code.
PR title and summary are descriptive. (conventions)
Docs updated or follow-up ticket created.
Tests included.
PR Labeled with Backport to Beta, Backport to Stable, or Backport to v1 (if the PR is an urgent fix that needs to be backported)

Add isolated sub-agent runner that instantiates builder agents with stubbed services, runs them to completion, and evaluates the resulting workflows. Includes 18 deterministic and 6 LLM binary checks ported from ai-workflow-builder.ee. Deterministic checks validate structural correctness (connectivity, triggers, required params, expressions, AI node wiring, etc.). LLM checks evaluate semantic quality (fulfills request, correct operations, data flow, naming, item handling, response accuracy). Includes LangSmith integration, HTML reporting, CLI entry point, in-memory sandbox stubs, and credential seeding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-10T07:39:59Z

⚠️ PR exceeds size limit (3,273 lines added)

This PR adds 3,273 lines, exceeding the 1,000-line limit.

Large PRs are harder to review and increase the risk of bugs going unnoticed. Please consider:

Breaking this into smaller, logically separate PRs
Moving unrelated changes to a follow-up PR

If the size is genuinely justified (e.g. generated code, large migrations, test fixtures), a maintainer can override by commenting /size-limit-override and then pushing a new commit or re-running this check.

codecov · 2026-04-10T07:42:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

schrothbn changed the title ~~feat(instance-ai): Add sub-agent evaluation harness with binary checks (no-changelog)~~ feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog) Apr 10, 2026

n8n-assistant bot added the n8n team Authored by the n8n team label Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog)#28289

feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog)#28289
schrothbn wants to merge 1 commit intomasterfrom
trust-1-feature-instanceai-sub-agent-evaluation-harness

schrothbn commented Apr 10, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

schrothbn commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Linear tickets, Github issues, and Community forum posts

Review / Merge checklist

Uh oh!

github-actions bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ PR exceeds size limit (3,273 lines added)

Uh oh!

codecov bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

schrothbn commented Apr 10, 2026 •

edited

Loading

github-actions bot commented Apr 10, 2026 •

edited

Loading

codecov bot commented Apr 10, 2026 •

edited

Loading