refactor(ai-builder): Implement unified evaluations harness #23955

OlegIvaniv · 2026-01-07T08:44:56Z

Summary

This PR rewrites the AI Workflow Builder evaluations into a single v2 harness with:

One runner (local + langsmith backends) that owns dataset setup, concurrency, timeouts, scoring, and artifact writing.
Multiple evaluators (LLM-as-judge, pairwise, programmatic, similarity) that are backend-agnostic and return a shared Feedback[] format.
A single CLI entrypoint that only orchestrates (parse args → build config → run).
A comprehensive Jest test suite covering the tricky behavior (scoring, CLI parsing, concurrency limiting, LangSmith trace behavior, CSV loading).

Why

The previous evals were hard to extend safely because responsibilities were duplicated across multiple runners:

Each runner owned its own CLI parsing and argument transformation, often with subtly different semantics.
Dataset setup was scattered and inconsistent (local vs LangSmith, different formats per runner).
Concurrency was not a single “knob”; nested parallelism (judge panels, multi-generation) could multiply total LLM calls and hit provider/LangSmith rate limits.
LangSmith integration was fragile: we were seeing missing metrics/traces and occasional 403 payload issues from large trace uploads.
There was little/no automated test coverage, so refactors were risky and regressions were hard to catch.

What changed (high level)

1) Unified runner / harness

The core entrypoint is evaluations/harness/runner.ts (runEvaluation(config)), which now owns:

Running a dataset in local mode (CLI prompt/CSV/test-case) and LangSmith mode (dataset name or preloaded examples).
Global concurrency control via a single llmCallLimiter (p-limit), used for generation and evaluator LLM calls.
Per-operation timeouts via withTimeout() (best-effort; does not cancel underlying requests).
Stable scoring and pass/fail status derived from Feedback.kind.
Optional artifact output (--output-dir) in local mode.

Evaluators are now “plugins” that don’t care about backends; they just evaluate a workflow + context and return feedback.

2) Shared types + feedback contract

All evaluators emit Feedback[] (evaluations/harness/harness-types.ts):

evaluator: stable evaluator id (e.g. llm-judge, pairwise, programmatic)
metric: metric key (supports dot-path sub-metrics)
score: normalized 0..1
kind: score | metric | detail (used by the scorer; detail should not affect overall scoring)
comment: optional explanation / violations

This contract is what keeps the harness readable: the runner doesn’t need evaluator-specific coercions or ad-hoc post-processing.

3) LangSmith integration fixes (reliable traces + smaller payloads)

Key decisions in the LangSmith runner:

Do not wrap the top-level evaluate() target with traceable(); the LangSmith SDK wraps it and attaches critical options (linking to examples, run lifecycle, client/project settings).
Do wrap inner operations with traceable() for child traces and explicitly pass client: lsClient so child traces attach to the same run tree.
Flush pending trace batches before returning.

To reduce payload size (and avoid 403 multipart errors on large runs), we keep “minimal tracing” enabled by default and filter heavy state fields via hideInputs/hideOutputs (evaluations/langsmith/trace-filters.ts). We explicitly keep messages untrimmed to avoid breaking downstream expectations.

4) Metric key compatibility for comparing old vs new runs

LangSmith metric key mapping is centralized (evaluations/harness/feedback.ts):

LLM-judge metrics are unprefixed (e.g. overallScore, maintainability.workflowOrganization) to match historical dashboards.
Programmatic metrics remain prefixed (e.g. programmatic.trigger) so they don’t collide with LLM-judge root keys.
Pairwise preserves v1 metric keys (e.g. pairwise_primary, pairwise_total_violations, etc.) as raw counts/ratios as before; additional judge/gen details are namespaced (e.g. pairwise.judge1).

5) Directory structure (separation by responsibility)

The evals folder is reorganized so ownership is obvious:

evaluations/cli/: CLI entry + arg parsing + CSV loader
evaluations/harness/: runner, lifecycle logging, scoring, artifacts, helper utilities
evaluations/evaluators/: evaluator factories (llm-judge, pairwise, programmatic, similarity)
evaluations/judge/: LLM-judge internals (schema + category evaluators + workflow evaluator)
evaluations/langsmith/: LangSmith helpers (types + trace filters)
evaluations/support/: environment setup (LLM + node types + LangSmith client), node loading, reports, test-case generation
evaluations/programmatic/python/: unchanged layout (Python similarity tooling)

Legacy/categorization evaluation artifacts were removed as part of cleanup.

Behavior changes / notes

Concurrency is now intended to be controlled primarily via --concurrency (global limiter); nested LLM calls are routed through the same limiter to prevent multiplicative parallelism.
Local mode can write artifacts via --output-dir (one folder per example + summary.json).
LangSmith mode rejects local-only prompt sources (--prompt, --prompts-csv, --test-case) and requires --dataset.

Tests

This PR adds/expands Jest coverage for:

CLI parsing edge cases and validation
CSV prompt loading (including dos/donts)
Local + LangSmith runner behavior (including dataset context extraction)
Scoring invariance when new detail metrics are added
Concurrency limiting of judge panels / multi-generation via p-limit
Trace filter behavior (keeps messages untrimmed while filtering heavy fields)
Metric key mapping contracts (LangSmith comparability)

Run from packages/@n8n/ai-workflow-builder.ee:

pnpm lint
pnpm typecheck
pnpm test:eval

How to verify (manual)

From packages/@n8n/ai-workflow-builder.ee:

# Local smoke test
pnpm eval --prompt "Create a workflow that..." --verbose --output-dir ./.data/out/llm-judge-local

# Pairwise local
pnpm eval:pairwise --prompt "Create a workflow that..." --dos "Must use Notion" --donts "No HTTP Request node" --verbose --output-dir ./.data/out/pairwise-local

# LangSmith (requires LANGSMITH_API_KEY)
pnpm eval:langsmith --dataset "workflow-builder-canvas-prompts" --name "v2-smoke" --max-examples 5 --concurrency 5 --verbose

Related Linear tickets, Github issues, and Community forum posts

Review / Merge checklist

PR title and summary are descriptive. (conventions)
Docs updated or follow-up ticket created.
Tests included.
PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)

Add trace filtering to avoid 403 errors from oversized LangSmith payloads during concurrent evaluations. Large fields like cachedTemplates and parsedNodeTypes are summarized while preserving essential debugging info. Changes: - Add trace-filters.ts with hideInputs/hideOutputs filtering logic - Add resetFilteringStats() to ensure accurate per-run statistics - Add TRACE_BATCH_SIZE_LIMIT and TRACE_BATCH_CONCURRENCY constants - Pass custom LangSmith client to evaluate() calls Filtering is enabled by default. Set LANGSMITH_MINIMAL_TRACING=false to disable and get full traces.

Replace global state with closure-scoped state per client instance. This avoids issues with parallel evaluations corrupting shared counters.

Pass EvalLogger through setupTestEnvironment to createTraceFilters for consistent logging across the evaluation system.

Extract helper functions to reduce cyclomatic complexity: - summarizeContextField() for context field placeholder strings - filterWorkflowContext() for workflowContext object filtering - summarizeLargeWorkflow() for conditional workflow summarization - trackInputPassthrough() for stat tracking on unchanged inputs

- Add verbose parameter to runLangsmithEvaluation() - Use EvalLogger for consistent logging output - Pass logger to setupTestEnvironment() for trace filter logging - Document CLI options for eval:langsmith in README

Add detailed verbose output showing: - Judge details: individual verdicts with brief justifications - Timing breakdown: generation time vs judge time with averages - Workflow summary: compact node type listing (e.g. "5 nodes (Webhook, IF, HTTP Request x2)") Works for both local pairwise (--prompt) and LangSmith pairwise modes.

Add --verbose flag to local CLI evaluation (pnpm eval) showing: - Per-test results as they complete (PASS/WARN/ERROR with score) - Generation timing - Workflow summary (node types) - Key category scores (functionality, connections, config) - Critical issues if any In verbose mode, progress bar is replaced with real-time test output.

- Add per-example result logging with prompt, scores, and pass/fail status - Add summary statistics (pass rate, average scores) at end of evaluation - Add dataset stats and model info in verbose mode - Enhance trace filtering: add messages array summarization - Reduce batch size limit to 2MB and add batchSizeLimit option - Add input field truncation for large LangChain model inputs

Upgrade to get: - batchSizeLimit option for controlling runs per batch - Better multipart upload handling (may fix 403 errors) - Memory leak fixes and async improvements - omitTracedRuntimeInfo option for smaller payloads

…bility

- Add unified argument parser for all CLI flags - Add ordered progress reporter for real-time verbose logging - Add abstract runner base class with template method pattern - Create LLM-judge runner that moves evaluation INTO target function (fixes "Run not created by target function" error in LangSmith 0.4.x) - Create pairwise runner using new architecture - Update index.ts to use new unified runners Key fix: LangSmith 0.4.x requires all LLM calls to happen inside the traceable target function. The old evaluator was calling LLM chains outside the traceable context, causing 403 errors.

- Add core interfaces (Evaluator, Feedback, RunConfig, Lifecycle) - Add runner supporting both local and LangSmith modes - Add console lifecycle for progress reporting - Add comprehensive test coverage (55 tests passing) Key design decisions: - Factory pattern for evaluator creation - Pre-computed feedback pattern for LangSmith compatibility - Parallel evaluators, sequential examples - Skip and continue on errors

- LLM-judge evaluator wraps existing evaluateWorkflow chain - Programmatic evaluator wraps rule-based checks - Add index files for module exports - All 60 tests passing Factory pattern enables: - Easy composition of evaluators - Parallel execution - Centralized error handling via runner

- Create runV2Evaluation() function that ties together: - Environment setup - Workflow generator - Evaluator factories - Console lifecycle - Local and LangSmith modes - Demonstrates full v2 harness usage - 60 tests passing Signed-off-by: Oleg Ivaniv <[email protected]>

- Add pairwise evaluator factory wrapping runJudgePanel() - Add 9 tests for pairwise evaluator (TDD) - Fix prompt not being passed to LLM-judge evaluator context - Fix LangSmith dataset format (support messages[] array) - Improve verbose output with critical metrics and violations - Remove truncation from violation output - Add README.md with mental model and documentation

- Add CLI tests (19 tests) covering loadTestCases, mode selection, config building, exit codes, and workflow generator setup - Add programmatic evaluator tests (9 tests) covering all feedback categories and violation formatting - Expand lifecycle tests (28 additional tests) for verbose output, critical metrics display, violations, and merge functions Coverage improvement: - cli.ts: 0% → 76% - programmatic.ts: 0% → 100% - lifecycle.ts: 66% → 100% - Overall v2/: 63% → 89%

…e filtering to v2 Artifact saving: - Add createArtifactSaver() for persisting evaluation results to disk - Save prompt.txt, workflow.json, feedback.json per example - Save summary.json with per-evaluator statistics - 13 tests for output module Similarity evaluator: - Add createSimilarityEvaluator() wrapping Python graph edit distance - Support single and multiple reference workflows - Support preset configurations (strict/standard/lenient) - 11 tests for similarity evaluator Trace filtering: - Integrate trace filtering into LangSmith mode (enabled by default) - Add enableTraceFiltering option to LangsmithOptions - Re-export createTraceFilters and isMinimalTracingEnabled Coverage: 90% statements, 80% branches, 140 tests passing

Add utilities for analyzing LLM token cache performance: - calculateCacheStats() - compute stats from token usage metadata - aggregateCacheStats() - aggregate multiple stats with correct hit rate - formatCacheStats() - format for display with locale strings 14 tests for cache analyzer module.

Add ability to generate multiple workflows per prompt and aggregate results for variance reduction in pairwise evaluation. New features: - createPairwiseEvaluator({ numGenerations: N }) - generate N workflows - aggregateGenerations() - calculate generation correctness - getMajorityThreshold() - majority voting utility - CLI support via --generations flag Feedback keys for multi-gen: - pairwise.generationCorrectness (passing/total) - pairwise.aggregatedDiagnostic (avg score) - pairwise.genN.majorityPass/diagnosticScore (per-gen details) 19 new tests for multi-gen functionality.

…se generator to v2 - Add score-calculator.ts with weighted scoring and evaluator grouping - Add report-generator.ts for markdown report generation - Add test-case-generator.ts with LLM-based generation and basicTestCases - Fix --max-examples flag for LangSmith mode in runner-base.ts - Export all new utilities from v2/index.ts 62 new tests added (235 total for v2)

…aceable() The LangSmith SDK's evaluate() function checks if the target is already wrapped with traceable(). When it is, the SDK skips applying critical defaultOptions (on_end, reference_example_id, client), causing issues like: - Target function executing multiple times per example - Missing traces in LangSmith dashboard - Client mismatch between evaluate() and inner traceable() calls The fix: - Do NOT wrap target function with traceable() - let evaluate() handle it - DO wrap inner operations (like generateWorkflow) with traceable() for child trace visibility Both v2 runner and pairwise generator now follow this harmonized pattern.

- Add trace flushing with awaitPendingTraceBatches() to ensure all traces are sent before process exits - Fix exit code in v2 CLI: LangSmith mode now returns 0 on success since results are in the dashboard, not the placeholder summary - Use limit parameter on listExamples() instead of fetching all and slicing to avoid double target execution - Add --dataset CLI flag for specifying LangSmith dataset name - Add maxExamples option to LangsmithOptions type - Fix message type detection in trace-filters to avoid errors with unusual message objects

Add comprehensive documentation about LangSmith SDK interactions: - Root cause of traceable() + evaluate() conflicts - Correct pattern: don't wrap target, do wrap inner operations - Environment variables required for tracing - Trace flushing importance - Payload size filtering tips - numRepetitions behavior - AsyncLocalStorage context tracking - Client consistency requirements - Debugging tips and SDK source location Also updates CLI usage examples and file structure documentation.

Signed-off-by: Oleg Ivaniv <[email protected]>

… mode - Add logger to RunConfig and pass it from CLI to runner - Replace console.log calls in runner.ts with proper logger methods: - Per-workflow progress → logger.verbose() (only shown with --verbose) - Important status messages → logger.info() (always shown) - Remove [v2] prefix from log messages (no longer needed post-migration) - Remove redundant console.warn calls from programmatic-evaluation.ts (error info is already captured in the returned violation result) This enables cleaner output by default while preserving detailed logging via the --verbose flag.

- runner.ts: Handle unknown error type in template literal with proper type narrowing (instanceof Error check) - trace-filters.ts: Replace unsafe type casts with proper type guards (hasGetTypeMethod, hasTypeProperty, getTypeName) for safe property access - test-case-generator.ts: Add Zod-based validation with parseTestCasesOutput() to safely type LLM structured output - test-case-generator.test.ts: Create helper functions with type guards to safely extract data from jest mock calls - output.test.ts: Use jsonParse<T> from n8n-workflow with proper type interfaces for type-safe JSON parsing in tests All fixes use runtime type guards instead of type casting to satisfy strict TypeScript lint rules.

- Require langsmithClient in LangSmith RunConfig and pass it from CLI setupTestEnvironment - Remove runner-side setupTestEnvironment call to avoid re-initialization/config drift - Ensure nested traceable() uses the same client instance via TraceableConfig.client - Update unit tests for new RunConfig contract

- Prevent NaN in verbose logs, artifacts, and reports by averaging finite scores only\n- Clarify weighting layers (LLM category vs cross-evaluator) with explicit names\n- Tighten LangSmith reference workflow parsing and ignore invalid refs\n- Apply timeouts inside p-limit work; document best-effort cancellation\n- Add focused tests for NaN handling, ref parsing, and limiter-slot timeouts

- Remove cache/usage reporting from eval CLI and LangSmith evaluator\n- Delete unused cache analyzer module and tests\n- Keep full message arrays in LangSmith trace filtering (no summarization)

- Cover header order, do/dont aliases, BOM, headerless CSV, and empty-row handling\n- Add negative-path assertions for missing/empty files and no valid prompts Signed-off-by: Oleg Ivaniv <[email protected]>

Move eval harness into clear submodules (cli/harness/judge/langsmith/support) and update scripts/docs/tests accordingly. - Keep python evals under evaluations/programmatic/python unchanged - Remove categorization-eval artifacts (types + prompts example) - Fix load-nodes path resolution and update downstream imports

Refresh evaluations README for the v2 harness. - Add quick start, prerequisites, CSV format, and component map - Clarify Feedback.kind semantics and LangSmith metric key mapping - Remove SDK-internals/debug-only LangSmith gotchas (keep only the critical traceable rule)

cubic-dev-ai

13 issues found across 109 files

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:57">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type guard instead of `as` for type narrowing. The pattern `typeof x === 'object' && x !== null` can be encapsulated in a reusable type guard like `isRecord(value): value is Record<string, unknown>`.</violation>

<violation number="2" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:75">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type guard instead of `as` for type narrowing. This is the same pattern as `summarizeWorkflow` - consider creating a shared `isRecord()` type guard.</violation>

<violation number="3" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:141">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type guard instead of `as` for type narrowing. Consider creating a type guard function that validates the structure (e.g., `hasNodes(value): value is { nodes?: unknown[] }`).</violation>

<violation number="4" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:202">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type guard instead of `as` for type narrowing. The `typeof === 'object'` check validates runtime type but a type guard would properly narrow the TypeScript type.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts:306">
P2: Inverted verbose condition: `onEvaluatorError` only logs in non-verbose mode, which is the opposite of other hooks (`onExampleStart`, `onExampleComplete`). Evaluator errors will be silently suppressed in verbose mode.</violation>

<violation number="2" location="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts:392">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type predicate instead of `as` for type narrowing after filtering. The `filter(Boolean)` pattern doesn't automatically narrow types in TypeScript - use a type guard function instead.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/evaluators/programmatic/index.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/evaluators/programmatic/index.ts:72">
P2: Missing feedback entries for `nodes` and `credentials` evaluation results. The underlying `programmaticEvaluation` computes and returns both `result.nodes` and `result.credentials` with `.score` and `.violations` properties (same structure as other categories), but these are not included in the feedback array. This means these metric scores won't be visible on dashboards for debugging/analysis, even though they contribute to the overall score.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts:51">
P2: Test description doesn't match assertion: the test says "should return 2 for 2 judges" but the assertion expects 1, not 2.</violation>

<violation number="2" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts:55">
P2: Test description doesn't match assertion: the test says "should return 3 for 4 judges" but the assertion expects 2, not 3.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts:56">
P2: For even numbers of judges, `Math.ceil(numJudges / 2)` returns the threshold for a tie (50%), not a majority (>50%). With 2 judges, this returns 1 (50%), and with 4 judges, it returns 2 (50%). A true majority threshold should use `Math.floor(numJudges / 2) + 1`.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts:401">
P3: This test is redundant with 'should exit with 0 when pass rate >= 70%' above - both use `passed: 7, totalExamples: 10` which is exactly 70%. Consider removing this duplicate test or changing the values to truly test the boundary (e.g., `passed: 70, totalExamples: 100` to make the "exactly 70%" intent clearer).</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts:497">
P2: Test assertion doesn't verify parallelism as the test name claims. `expect(callTimes).toHaveLength(3)` only confirms all calls were made, not that they ran in parallel. Consider checking that all timestamps fall within a small window (e.g., all within 10ms of the first call) to verify true parallel execution.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts:27">
P2: Test doesn't verify filtering behavior as claimed. The `cachedTemplates` input only contains `templateId` and `name` - the exact properties that `summarizeCachedTemplates` preserves. Consider adding extra properties (e.g., `workflow: { nodes: [] }`) to the template object to verify they are actually filtered out.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-01-08T09:08:39Z

packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts

+	if (!workflow || typeof workflow !== 'object') {
+		return workflow;
+	}
+	const wf = workflow as { nodes?: unknown[] };


P2: Rule violated: Prefer Typeguards over Type casting

Use a type guard instead of as for type narrowing. Consider creating a type guard function that validates the structure (e.g., hasNodes(value): value is { nodes?: unknown[] }).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 141: <comment>Use a type guard instead of `as` for type narrowing. Consider creating a type guard function that validates the structure (e.g., `hasNodes(value): value is { nodes?: unknown[] }`).</comment> <file context> @@ -0,0 +1,250 @@ + if (!workflow || typeof workflow !== 'object') { + return workflow; + } + const wf = workflow as { nodes?: unknown[] }; + if (wf.nodes && wf.nodes.length > WORKFLOW_SUMMARY_THRESHOLD) { + return summarizeWorkflow(workflow); </file context>

✅ Addressed in 3ed9be4

cubic-dev-ai · 2026-01-08T09:08:39Z

packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts

+function summarizeCachedTemplates(templates: unknown[]): Array<Record<string, unknown>> {
+	return templates.map((t) => {
+		if (!t || typeof t !== 'object') return { unknown: true };
+		const template = t as Record<string, unknown>;


P2: Rule violated: Prefer Typeguards over Type casting

Use a type guard instead of as for type narrowing. This is the same pattern as summarizeWorkflow - consider creating a shared isRecord() type guard.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 75: <comment>Use a type guard instead of `as` for type narrowing. This is the same pattern as `summarizeWorkflow` - consider creating a shared `isRecord()` type guard.</comment> <file context> @@ -0,0 +1,250 @@ +function summarizeCachedTemplates(templates: unknown[]): Array<Record<string, unknown>> { + return templates.map((t) => { + if (!t || typeof t !== 'object') return { unknown: true }; + const template = t as Record<string, unknown>; + return { + templateId: template.templateId, </file context>

✅ Addressed in 3ed9be4

cubic-dev-ai · 2026-01-08T09:08:39Z

packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts

+		return { unknown: true };
+	}
+
+	const wf = workflow as Record<string, unknown>;


P2: Rule violated: Prefer Typeguards over Type casting

Use a type guard instead of as for type narrowing. The pattern typeof x === 'object' && x !== null can be encapsulated in a reusable type guard like isRecord(value): value is Record<string, unknown>.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 57: <comment>Use a type guard instead of `as` for type narrowing. The pattern `typeof x === 'object' && x !== null` can be encapsulated in a reusable type guard like `isRecord(value): value is Record<string, unknown>`.</comment> <file context> @@ -0,0 +1,250 @@ + return { unknown: true }; + } + + const wf = workflow as Record<string, unknown>; + const nodes = wf.nodes as Array<{ name?: string }> | undefined; + const connections = wf.connections as Record<string, unknown> | undefined; </file context>

✅ Addressed in 3ed9be4

cubic-dev-ai · 2026-01-08T09:08:39Z

packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts

+		// Handle workflowContext if present
+		if (filtered.workflowContext && typeof filtered.workflowContext === 'object') {
+			filtered.workflowContext = filterWorkflowContext(
+				filtered.workflowContext as Record<string, unknown>,


P2: Rule violated: Prefer Typeguards over Type casting

Use a type guard instead of as for type narrowing. The typeof === 'object' check validates runtime type but a type guard would properly narrow the TypeScript type.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 202: <comment>Use a type guard instead of `as` for type narrowing. The `typeof === 'object'` check validates runtime type but a type guard would properly narrow the TypeScript type.</comment> <file context> @@ -0,0 +1,250 @@ + // Handle workflowContext if present + if (filtered.workflowContext && typeof filtered.workflowContext === 'object') { + filtered.workflowContext = filterWorkflowContext( + filtered.workflowContext as Record<string, unknown>, + ); + } </file context>

✅ Addressed in 3ed9be4

cubic-dev-ai · 2026-01-08T09:08:39Z

packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts

+export function mergeLifecycles(
+	...lifecycles: Array<Partial<EvaluationLifecycle> | undefined>
+): EvaluationLifecycle {
+	const validLifecycles = lifecycles.filter(Boolean) as Array<Partial<EvaluationLifecycle>>;


P2: Rule violated: Prefer Typeguards over Type casting

Use a type predicate instead of as for type narrowing after filtering. The filter(Boolean) pattern doesn't automatically narrow types in TypeScript - use a type guard function instead.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts, line 392: <comment>Use a type predicate instead of `as` for type narrowing after filtering. The `filter(Boolean)` pattern doesn't automatically narrow types in TypeScript - use a type guard function instead.</comment> <file context> @@ -0,0 +1,437 @@ +export function mergeLifecycles( + ...lifecycles: Array<Partial<EvaluationLifecycle> | undefined> +): EvaluationLifecycle { + const validLifecycles = lifecycles.filter(Boolean) as Array<Partial<EvaluationLifecycle>>; + + return { </file context>

Suggested change

const validLifecycles = lifecycles.filter(Boolean) as Array<Partial<EvaluationLifecycle>>;

const validLifecycles = lifecycles.filter(

(lc): lc is Partial<EvaluationLifecycle> => Boolean(lc),

);

✅ Addressed in 3ed9be4

cubic-dev-ai · 2026-01-08T09:08:39Z

packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts

+			expect(getMajorityThreshold(5)).toBe(3);
+		});
+
+		it('should return 2 for 2 judges', () => {


P2: Test description doesn't match assertion: the test says "should return 2 for 2 judges" but the assertion expects 1, not 2.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts, line 51: <comment>Test description doesn't match assertion: the test says "should return 2 for 2 judges" but the assertion expects 1, not 2.</comment> <file context> @@ -0,0 +1,159 @@ + expect(getMajorityThreshold(5)).toBe(3); + }); + + it('should return 2 for 2 judges', () => { + expect(getMajorityThreshold(2)).toBe(1); + }); </file context>

Suggested change

it('should return 2 for 2 judges', () => {

it('should return 1 for 2 judges', () => {

✅ Addressed in 3ed9be4

cubic-dev-ai · 2026-01-08T09:08:39Z

packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts

+	if (!Number.isFinite(numJudges) || numJudges < 1) {
+		throw new Error(`numJudges must be >= 1 (received ${String(numJudges)})`);
+	}
+	return Math.ceil(numJudges / 2);


P2: For even numbers of judges, Math.ceil(numJudges / 2) returns the threshold for a tie (50%), not a majority (>50%). With 2 judges, this returns 1 (50%), and with 4 judges, it returns 2 (50%). A true majority threshold should use Math.floor(numJudges / 2) + 1.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts, line 56: <comment>For even numbers of judges, `Math.ceil(numJudges / 2)` returns the threshold for a tie (50%), not a majority (>50%). With 2 judges, this returns 1 (50%), and with 4 judges, it returns 2 (50%). A true majority threshold should use `Math.floor(numJudges / 2) + 1`.</comment> <file context> @@ -0,0 +1,97 @@ + if (!Number.isFinite(numJudges) || numJudges < 1) { + throw new Error(`numJudges must be >= 1 (received ${String(numJudges)})`); + } + return Math.ceil(numJudges / 2); +} + </file context>

Suggested change

return Math.ceil(numJudges / 2);

return Math.floor(numJudges / 2) + 1;

cubic-dev-ai · 2026-01-08T09:08:39Z

packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts

+				generateWorkflow: mockGenerateWorkflow,
+			});
+
+			expect(callTimes).toHaveLength(3);


P2: Test assertion doesn't verify parallelism as the test name claims. expect(callTimes).toHaveLength(3) only confirms all calls were made, not that they ran in parallel. Consider checking that all timestamps fall within a small window (e.g., all within 10ms of the first call) to verify true parallel execution.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts, line 497: <comment>Test assertion doesn't verify parallelism as the test name claims. `expect(callTimes).toHaveLength(3)` only confirms all calls were made, not that they ran in parallel. Consider checking that all timestamps fall within a small window (e.g., all within 10ms of the first call) to verify true parallel execution.</comment> <file context> @@ -0,0 +1,500 @@ + generateWorkflow: mockGenerateWorkflow, + }); + + expect(callTimes).toHaveLength(3); + }); + }); </file context>

cubic-dev-ai · 2026-01-08T09:08:40Z

packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts

+
+		const msg = { type: 'ai', content: 'hello' };
+		const input: KVMap = {
+			cachedTemplates: [{ templateId: 't1', name: 'Template' }],


P2: Test doesn't verify filtering behavior as claimed. The cachedTemplates input only contains templateId and name - the exact properties that summarizeCachedTemplates preserves. Consider adding extra properties (e.g., workflow: { nodes: [] }) to the template object to verify they are actually filtered out.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts, line 27: <comment>Test doesn't verify filtering behavior as claimed. The `cachedTemplates` input only contains `templateId` and `name` - the exact properties that `summarizeCachedTemplates` preserves. Consider adding extra properties (e.g., `workflow: { nodes: [] }`) to the template object to verify they are actually filtered out.</comment> <file context> @@ -0,0 +1,38 @@ + + const msg = { type: 'ai', content: 'hello' }; + const input: KVMap = { + cachedTemplates: [{ templateId: 't1', name: 'Template' }], + messages: [msg], + }; </file context>

Suggested change

cachedTemplates: [{ templateId: 't1', name: 'Template' }],

cachedTemplates: [{ templateId: 't1', name: 'Template', workflow: { nodes: [] }, fullDefinition: 'large data' }],

✅ Addressed in 3ed9be4

cubic-dev-ai · 2026-01-08T09:08:40Z

packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts

+				await expect(runV2Evaluation()).rejects.toThrow('process.exit(1)');
+			});
+
+			it('should exit with 0 when pass rate is exactly 70%', async () => {


P3: This test is redundant with 'should exit with 0 when pass rate >= 70%' above - both use passed: 7, totalExamples: 10 which is exactly 70%. Consider removing this duplicate test or changing the values to truly test the boundary (e.g., passed: 70, totalExamples: 100 to make the "exactly 70%" intent clearer).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts, line 401: <comment>This test is redundant with 'should exit with 0 when pass rate >= 70%' above - both use `passed: 7, totalExamples: 10` which is exactly 70%. Consider removing this duplicate test or changing the values to truly test the boundary (e.g., `passed: 70, totalExamples: 100` to make the "exactly 70%" intent clearer).</comment> <file context> @@ -0,0 +1,459 @@ + await expect(runV2Evaluation()).rejects.toThrow('process.exit(1)'); + }); + + it('should exit with 0 when pass rate is exactly 70%', async () => { + mockRunEvaluation.mockResolvedValue(createMockSummary({ totalExamples: 10, passed: 7 })); + </file context>

- Make artifact output folder names concurrency-safe (index + deterministic short id) - Enable --output-dir artifact writing in LangSmith mode - Keep summary results stable by sorting by example index

burivuhster

Nice refactor! And kudos for the test coverage.
Couple of suggestions/comments

burivuhster · 2026-01-08T11:47:30Z

packages/@n8n/ai-workflow-builder.ee/evaluations/cli/index.ts

+	// Build context - include generateWorkflow for multi-gen pairwise
+	const isMultiGen = args.suite === 'pairwise' && args.numGenerations > 1;
+	const llmCallLimiter = pLimit(args.concurrency);
+
+	const baseConfig = {
+		generateWorkflow,
+		evaluators,
+		lifecycle,
+		logger,
+		outputDir: args.outputDir,
+		timeoutMs: args.timeoutMs,
+		context: isMultiGen ? { generateWorkflow, llmCallLimiter } : { llmCallLimiter },
+	};


Is there a specific reason we need to pass generateWorkflow twice in the config?

burivuhster · 2026-01-08T11:50:39Z

packages/@n8n/ai-workflow-builder.ee/evaluations/cli/index.ts

+					mode: 'langsmith',
+					dataset: args.datasetName ?? getDefaultDatasetName(args.suite),
+					langsmithClient: (() => {
+						if (!env.lsClient) {


Would it make sense to move this check to environment init logic?

burivuhster · 2026-01-08T11:54:33Z

packages/@n8n/ai-workflow-builder.ee/evaluations/cli/argument-parser.ts

+	donts?: string;
+
+	numJudges: number;
+	numGenerations: number;


Can we unify this and use --repetitions parameter everywhere instead?

burivuhster · 2026-01-08T12:13:01Z

packages/@n8n/ai-workflow-builder.ee/evaluations/cli/csv-prompt-loader.ts

 export function loadTestCasesFromCsv(csvPath: string): TestCase[] {
 	const resolvedPath = path.isAbsolute(csvPath) ? csvPath : path.resolve(process.cwd(), csvPath);

 	if (!existsSync(resolvedPath)) {
 		throw new Error(`CSV file not found at ${resolvedPath}`);
 	}

 	const fileContents = readFileSync(resolvedPath, 'utf8');
 	const rows = parseCsv(fileContents);

 	if (rows.length === 0) {
 		throw new Error('The provided CSV file is empty');
 	}

 	let header: ParsedCsvRow | undefined;
 	let dataRows = rows;

 	if (isHeaderRow(rows[0])) {
 		header = rows[0]!;
 		dataRows = rows.slice(1);
 	}

 	if (dataRows.length === 0) {
 		throw new Error('No prompt rows found in the provided CSV file');
 	}

 	const promptIndex = header ? (detectColumnIndex(header, 'prompt') ?? 0) : 0;
 	const idIndex = header ? detectColumnIndex(header, 'id') : undefined;
-	const nameIndex = header
-		? (detectColumnIndex(header, 'name') ?? detectColumnIndex(header, 'title'))
+	const dosIndex = header
+		? (detectColumnIndex(header, 'dos') ?? detectColumnIndex(header, 'do'))
+		: undefined;
+	const dontsIndex = header
+		? (detectColumnIndex(header, 'donts') ?? detectColumnIndex(header, 'dont'))
 		: undefined;

 	const testCases = dataRows
 		.map<TestCase | undefined>((row, index) => {
 			const prompt = sanitizeValue(row[promptIndex]);
 			if (!prompt) {
 				return undefined;
 			}

 			const idSource = sanitizeValue(idIndex !== undefined ? row[idIndex] : undefined);
-			const nameSource = sanitizeValue(nameIndex !== undefined ? row[nameIndex] : undefined);
+			const dosSource = sanitizeValue(dosIndex !== undefined ? row[dosIndex] : undefined);
+			const dontsSource = sanitizeValue(dontsIndex !== undefined ? row[dontsIndex] : undefined);

 			return {
-				id: idSource || `csv-case-${index + 1}`,
-				name: nameSource || generateNameFromPrompt(prompt, index),
+				...(idSource ? { id: idSource } : { id: `csv-case-${index + 1}` }),
 				prompt,
+				...((dosSource || dontsSource) && {
+					context: {
+						...(dosSource ? { dos: dosSource } : {}),
+						...(dontsSource ? { donts: dontsSource } : {}),
+					},
+				}),
 			};
 		})
 		.filter((testCase): testCase is TestCase => testCase !== undefined);

 	if (testCases.length === 0) {
 		throw new Error('No valid prompts found in the provided CSV file');
 	}

 	return testCases;
 }


This function is a bit hard to read because of many undefineds, ternary and spread operators and different logic depending on whether we have a header.
Here's the Claude's take on how it can be siplified:

export function loadTestCasesFromCsv(csvPath: string): TestCase[] { const resolvedPath = path.isAbsolute(csvPath) ? csvPath : path.resolve(process.cwd(), csvPath); if (!existsSync(resolvedPath)) { throw new Error(`CSV file not found at ${resolvedPath}`); } const rows = parseCsv(readFileSync(resolvedPath, 'utf8')); if (rows.length === 0) { throw new Error('The provided CSV file is empty'); } const header = isHeaderRow(rows[0]) ? rows[0] : undefined; const dataRows = header ? rows.slice(1) : rows; if (dataRows.length === 0) { throw new Error('No prompt rows found in the provided CSV file'); } const col = (name: string, ...aliases: string[]) => { if (!header) return undefined; for (const n of [name, ...aliases]) { const idx = detectColumnIndex(header, n); if (idx !== undefined) return idx; } return undefined; }; const promptIdx = col('prompt') ?? 0; const idIdx = col('id'); const dosIdx = col('dos', 'do'); const dontsIdx = col('donts', 'dont'); const testCases: TestCase[] = []; for (let i = 0; i < dataRows.length; i++) { const row = dataRows[i]; const prompt = sanitizeValue(row[promptIdx]); if (!prompt) continue; const testCase: TestCase = { id: sanitizeValue(row[idIdx!]) || `csv-case-${i + 1}`, prompt, }; const dos = sanitizeValue(row[dosIdx!]); const donts = sanitizeValue(row[dontsIdx!]); if (dos || donts) { testCase.context = {}; if (dos) testCase.context.dos = dos; if (donts) testCase.context.donts = donts; } testCases.push(testCase); } if (testCases.length === 0) { throw new Error('No valid prompts found in the provided CSV file'); } return testCases; }

burivuhster · 2026-01-08T12:24:05Z

packages/@n8n/ai-workflow-builder.ee/evaluations/evaluators/llm-judge/evaluators/base.ts

Should this whole judge/evaluators directory be moved under evaluators/llm-judge?

burivuhster · 2026-01-08T13:39:36Z

packages/@n8n/ai-workflow-builder.ee/evaluations/harness/harness-types.ts

+	/** Optional reference workflow for similarity-based checks */
+	referenceWorkflow?: SimpleWorkflow;
+	/** Optional reference workflows for similarity-based checks (best match wins) */
+	referenceWorkflows?: SimpleWorkflow[];


Could also be a single array referenceWorkflows, consumers may pass array of a single item instead of referenceWorkflow

burivuhster · 2026-01-08T13:43:45Z

packages/@n8n/ai-workflow-builder.ee/evaluations/harness/harness-types.ts

+	 * Optional generator for multi-generation evaluations.
+	 * When present, pairwise evaluator can generate multiple workflows from the same prompt.
+	 */
+	generateWorkflow?: (prompt: string) => Promise<SimpleWorkflow>;


Not sure how much work is to move workflow generation upper in the call chain, but it would be nice to make evaluator agnostic of the multi-generation and just always process a single generation. It would require some higher-level code doing the metrics aggregation though.

burivuhster · 2026-01-08T14:02:50Z

packages/@n8n/ai-workflow-builder.ee/evaluations/harness/output.ts

+		hash = Math.imul(hash, 0x01000193);
+	}
+	return hash >>> 0;
+}


burivuhster · 2026-01-08T14:37:25Z

packages/@n8n/ai-workflow-builder.ee/evaluations/harness/runner.ts

+			);
+			const evalDurationMs = Date.now() - evalStart;
+
+			const totalDurationMs = Date.now() - startTime;
+			const score = calculateExampleScore(feedback);
+			const status = hasErrorFeedback(feedback)
+				? 'error'
+				: determineStatus({ score, passThreshold });
+			stats.total++;
+			stats.scoreSum += score;
+			stats.durationSumMs += totalDurationMs;
+			if (status === 'pass') stats.passed++;
+			else if (status === 'fail') stats.failed++;
+			else stats.errors++;
+			const result: ExampleResult = {


Suggested change

);

const evalDurationMs = Date.now() - evalStart;

const totalDurationMs = Date.now() - startTime;

const score = calculateExampleScore(feedback);

const status = hasErrorFeedback(feedback)

? 'error'

: determineStatus({ score, passThreshold });

stats.total++;

stats.scoreSum += score;

stats.durationSumMs += totalDurationMs;

if (status === 'pass') stats.passed++;

else if (status === 'fail') stats.failed++;

else stats.errors++;

const result: ExampleResult = {

);

const evalDurationMs = Date.now() - evalStart;

const totalDurationMs = Date.now() - startTime;

const score = calculateExampleScore(feedback);

const status = hasErrorFeedback(feedback)

? 'error'

: determineStatus({ score, passThreshold });

stats.total++;

stats.scoreSum += score;

stats.durationSumMs += totalDurationMs;

if (status === 'pass') stats.passed++;

else if (status === 'fail') stats.failed++;

else stats.errors++;

const result: ExampleResult = {

(Just line breaks to improve readability)

burivuhster · 2026-01-08T14:54:05Z

packages/@n8n/ai-workflow-builder.ee/evaluations/README.md

+│                           runEvaluation(config)                          │
+│                                                                          │
+│  Config contains:                                                        │


ASCII-art schemes have rough edges 😅

mike12345567

Really great cleanup, really tidies this up which was starting to get a little out of control!

Left a few comments I think we should consider addressing, one other thing as well, I think it would be nice to maintain some of the old functionality we had, or at least have a package.json command to achieve it. Being able to run the matrix suite with our old prompts could be useful, because they've been a benchmark so far as we have an idea of how they score, I think it would also be nice to have a command for the old default output directory and markdown report generation - the JSON is really useful, but its also handy to have a quick report we can look through.

Really like the generation splitting up the workflow JSON and feedback JSON, thats a massive pain reduction, always found myself trying to scrape the correct workflow out of the massive results JSON!

mike12345567 · 2026-01-08T16:01:18Z

packages/@n8n/ai-workflow-builder.ee/evaluations/evaluators/llm-judge/index.ts

+				fb('maintainability.modularity', result.maintainability.modularity, 'detail'),
+
+				// Overall score
+				fb('overallScore', result.overallScore, 'score', result.summary),


Should this also emit the score for the bestPractices?

mike12345567 · 2026-01-08T16:07:44Z

packages/@n8n/ai-workflow-builder.ee/evaluations/harness/score-calculator.ts

+ * Weights should sum to approximately 1.0.
+ */
+export const DEFAULT_EVALUATOR_WEIGHTS: ScoreWeights = {
+	'llm-judge': 0.4,


Does the similarity evals need a weighting as well?

mike12345567 · 2026-01-08T16:12:40Z

packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/types.test.ts

@@ -0,0 +1,331 @@
+import { mock } from 'jest-mock-extended';


Is this test suite useful, I don't think we usually do type testing, I'm wondering if its just an over-reach by Claude!

mike12345567 · 2026-01-08T16:16:58Z

packages/@n8n/ai-workflow-builder.ee/evaluations/README.md


-#### 3. Workflow Evaluator (`chains/workflow-evaluator.ts`)
+# Local: LLM-judge + programmatic
+pnpm eval --prompt "Create a workflow that..." --verbose


Do we not have an option now to run with our standard set of eval prompts as we did before?

mike12345567 · 2026-01-08T16:28:49Z

packages/@n8n/ai-workflow-builder.ee/evaluations/cli/argument-parser.ts

+
+type FlagDef = { key: CliKey; kind: CliValueKind };
+
+const FLAG_TO_KEY: Record<string, FlagDef> = {


Would it be possible to add a --help command to list some of this information? Tried it locally, think it could be helpful.

Signed-off-by: Oleg Ivaniv <[email protected]>

Use yargs with schema validation and help/alias handling, drop bespoke parsing test, and tidy imports. Signed-off-by: Oleg Ivaniv <[email protected]>

- remove yargs dependency and restore the manual argument parser - add focused parser tests for numeric flags, filters, and edge cases - clean up runner debug logging Signed-off-by: Oleg Ivaniv <[email protected]>

Move the lsClient check to happen right after environment setup rather than inline during config building. This fails fast when LANGSMITH_API_KEY is missing in langsmith mode.

- Add findColumn() helper to handle column aliases and header checks - Add getCell() helper to eliminate awkward ternary patterns - Replace map/filter chain with explicit for-loop - Build context conditionally with if-statements instead of nested spreads

Move all LLM judge-related files from evaluations/judge/ into evaluations/evaluators/llm-judge/ for better organization. - Move judge/evaluators/* to evaluators/llm-judge/evaluators/ - Move judge/evaluation.ts and workflow-evaluator.ts - Update all import paths Signed-off-by: Oleg Ivaniv <[email protected]>

…cond)

Unify the API to only use referenceWorkflows (array) instead of having both referenceWorkflow (single) and referenceWorkflows (array). This simplifies the interface and reduces confusion. Added backwards compatibility in runner.ts to handle legacy datasets that still use the single referenceWorkflow field.

Use Node's built-in crypto module for generating deterministic short IDs instead of manual bit manipulation. Much more readable while providing the same functionality.

Separate logical groups in processExample with blank lines: - duration calculations - score and status - stats updates - result creation

Mermaid diagrams render cleanly on GitHub and avoid alignment issues with ASCII art.

Add explicit weight for similarity evaluator (0.15) and rebalance: - llm-judge: 0.35 (was 0.4) - programmatic: 0.25 (was 0.3) - pairwise: 0.25 (was 0.3) - similarity: 0.15 (new)

Move default test cases from hardcoded TypeScript array to a CSV fixture file at fixtures/default-prompts.csv. This makes it easier to modify test cases without code changes and provides a cleaner separation of data. - Add fixtures/default-prompts.csv with 10 default test prompts - Add loadDefaultTestCases() and getDefaultTestCaseIds() helpers - Update CLI to use CSV fixture instead of basicTestCases - Update README to document default prompts usage - Remove basicTestCases export from test-case-generator.ts

Auto-generate help text from flag definitions with descriptions and groups. Flags are organized by category: Input, Evaluation, Pairwise, LangSmith, Output, Feature Flags, and Advanced.

cubic-dev-ai

1 issue found across 41 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/README.md">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/README.md:81">
P3: The diagram is missing `'detail'` from the `kind` type. The actual `Feedback` interface (and the Feedback section later in this file) defines `kind: 'score' | 'metric' | 'detail'`.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-01-08T20:44:41Z

packages/@n8n/ai-workflow-builder.ee/evaluations/README.md

+        F1["evaluator: string"]
+        F2["metric: string"]
+        F3["score: 0-1"]
+        F4["kind: 'score' | 'metric'"]


P3: The diagram is missing 'detail' from the kind type. The actual Feedback interface (and the Feedback section later in this file) defines kind: 'score' | 'metric' | 'detail'.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/README.md, line 81: <comment>The diagram is missing `'detail'` from the `kind` type. The actual `Feedback` interface (and the Feedback section later in this file) defines `kind: 'score' | 'metric' | 'detail'`.</comment> <file context> @@ -38,47 +44,43 @@ popd + F1["evaluator: string"] + F2["metric: string"] + F3["score: 0-1"] + F4["kind: 'score' | 'metric'"] + F5["comment?: string"] + end </file context>

Suggested change

F4["kind: 'score' | 'metric'"]

F4["kind: 'score' | 'metric' | 'detail'"]

OlegIvaniv added 30 commits December 30, 2025 15:13

feat(ai-builder): Add --name option to LLM evals

076b8c1

refactor(ai-builder): use closure-scoped state for trace filtering stats

e4085b1

Replace global state with closure-scoped state per client instance. This avoids issues with parallel evaluations corrupting shared counters.

refactor(ai-builder): use EvalLogger for trace filtering output

1c18e9f

Pass EvalLogger through setupTestEnvironment to createTraceFilters for consistent logging across the evaluation system.

feat(ai-builder): add --verbose flag to LLM-as-judge evaluation

5b2afef

- Add verbose parameter to runLangsmithEvaluation() - Use EvalLogger for consistent logging output - Pass logger to setupTestEnvironment() for trace filter logging - Document CLI options for eval:langsmith in README

chore(ai-builder): upgrade langsmith from 0.3.45 to 0.4.2

5ae9250

Upgrade to get: - batchSizeLimit option for controlling runs per batch - Better multipart upload handling (may fix 403 errors) - Memory leak fixes and async improvements - omitTracedRuntimeInfo option for smaller payloads

fix(ai-builder): enable LANGSMITH_TRACING for langsmith 0.4.x compati…

dee3036

…bility

Merge v1 evals with v2

0627f72

Signed-off-by: Oleg Ivaniv <[email protected]>

refactor(ai-evals): type-safe v2 harness context

eb83de2

chore(ai-evals): remove legacy evaluation runners

32e5c65

refactor(ai-evals): reorganize evaluators + fixtures

7b87ccd

OlegIvaniv added 7 commits January 7, 2026 22:19

ai-builder: drop cache reporting

3c1ff2e

- Remove cache/usage reporting from eval CLI and LangSmith evaluator\n- Delete unused cache analyzer module and tests\n- Keep full message arrays in LangSmith trace filtering (no summarization)

ai-builder: expand csv loader tests

832ec2f

- Cover header order, do/dont aliases, BOM, headerless CSV, and empty-row handling\n- Add negative-path assertions for missing/empty files and no valid prompts Signed-off-by: Oleg Ivaniv <[email protected]>

Merge branch 'master' into ai-1776-v2-evals

cdc9dc2

OlegIvaniv marked this pull request as ready for review January 8, 2026 08:57

OlegIvaniv requested review from burivuhster and mike12345567 January 8, 2026 08:58

cubic-dev-ai bot reviewed Jan 8, 2026

View reviewed changes

ai-builder: save artifacts for langsmith runs

3539930

- Make artifact output folder names concurrency-safe (index + deterministic short id) - Enable --output-dir artifact writing in LangSmith mode - Keep summary results stable by sorting by example index

burivuhster reviewed Jan 8, 2026

View reviewed changes

mike12345567 reviewed Jan 8, 2026

View reviewed changes

Address Cubic comments

3ed9be4

Signed-off-by: Oleg Ivaniv <[email protected]>

This comment has been minimized.

Sign in to view

OlegIvaniv added 13 commits January 8, 2026 20:03

ai-builder: switch eval CLI to yargs

8a126d2

Use yargs with schema validation and help/alias handling, drop bespoke parsing test, and tidy imports. Signed-off-by: Oleg Ivaniv <[email protected]>

ai-builder: drop yargs experiment and harden CLI parser tests

b42cf7a

- remove yargs dependency and restore the manual argument parser - add focused parser tests for numeric flags, filters, and edge cases - clean up runner debug logging Signed-off-by: Oleg Ivaniv <[email protected]>

ai-builder: validate LangSmith client early in CLI

8a4166a

Move the lsClient check to happen right after environment setup rather than inline during config building. This fails fast when LANGSMITH_API_KEY is missing in langsmith mode.

ai-builder: reorder runWithOptionalLimiter args (fn first, limiter se…

f18a8e0

…cond)

ai-builder: replace FNV-1a hash with crypto.createHash

dbff71a

Use Node's built-in crypto module for generating deterministic short IDs instead of manual bit manipulation. Much more readable while providing the same functionality.

ai-builder: add line breaks in runner for readability

cd22f0e

Separate logical groups in processExample with blank lines: - duration calculations - score and status - stats updates - result creation

ai-builder: replace ASCII diagram with mermaid in README

078ca8c

Mermaid diagrams render cleanly on GitHub and avoid alignment issues with ASCII art.

ai-builder: add similarity evaluator to default weights

33ce3e7

Add explicit weight for similarity evaluator (0.15) and rebalance: - llm-judge: 0.35 (was 0.4) - programmatic: 0.25 (was 0.3) - pairwise: 0.25 (was 0.3) - similarity: 0.15 (new)

ai-builder: add --help flag to eval CLI

c44b567

Auto-generate help text from flag definitions with descriptions and groups. Flags are organized by category: Input, Evaluation, Pairwise, LangSmith, Output, Feature Flags, and Advanced.

cubic-dev-ai bot reviewed Jan 8, 2026

View reviewed changes

	it('should return 2 for 2 judges', () => {
	it('should return 1 for 2 judges', () => {

	return Math.ceil(numJudges / 2);
	return Math.floor(numJudges / 2) + 1;

	cachedTemplates: [{ templateId: 't1', name: 'Template' }],
	cachedTemplates: [{ templateId: 't1', name: 'Template', workflow: { nodes: [] }, fullDefinition: 'large data' }],


		type FlagDef = { key: CliKey; kind: CliValueKind };

		const FLAG_TO_KEY: Record<string, FlagDef> = {

	F4["kind: 'score' \| 'metric'"]
	F4["kind: 'score' \| 'metric' \| 'detail'"]

refactor(ai-builder): Implement unified evaluations harness #23955

Are you sure you want to change the base?

refactor(ai-builder): Implement unified evaluations harness #23955

Conversation

OlegIvaniv commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed (high level)

1) Unified runner / harness

2) Shared types + feedback contract

3) LangSmith integration fixes (reliable traces + smaller payloads)

4) Metric key compatibility for comparing old vs new runs

5) Directory structure (separation by responsibility)

Behavior changes / notes

Tests

How to verify (manual)

Related Linear tickets, Github issues, and Community forum posts

Review / Merge checklist

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

burivuhster left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OlegIvaniv commented Jan 7, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading

cubic-dev-ai bot Jan 8, 2026 •

edited

Loading