Skip to content

Conversation

@OlegIvaniv
Copy link
Contributor

@OlegIvaniv OlegIvaniv commented Jan 7, 2026

Summary

This PR rewrites the AI Workflow Builder evaluations into a single v2 harness with:

  • One runner (local + langsmith backends) that owns dataset setup, concurrency, timeouts, scoring, and artifact writing.
  • Multiple evaluators (LLM-as-judge, pairwise, programmatic, similarity) that are backend-agnostic and return a shared Feedback[] format.
  • A single CLI entrypoint that only orchestrates (parse args → build config → run).
  • A comprehensive Jest test suite covering the tricky behavior (scoring, CLI parsing, concurrency limiting, LangSmith trace behavior, CSV loading).

Why

The previous evals were hard to extend safely because responsibilities were duplicated across multiple runners:

  • Each runner owned its own CLI parsing and argument transformation, often with subtly different semantics.
  • Dataset setup was scattered and inconsistent (local vs LangSmith, different formats per runner).
  • Concurrency was not a single “knob”; nested parallelism (judge panels, multi-generation) could multiply total LLM calls and hit provider/LangSmith rate limits.
  • LangSmith integration was fragile: we were seeing missing metrics/traces and occasional 403 payload issues from large trace uploads.
  • There was little/no automated test coverage, so refactors were risky and regressions were hard to catch.

What changed (high level)

1) Unified runner / harness

The core entrypoint is evaluations/harness/runner.ts (runEvaluation(config)), which now owns:

  • Running a dataset in local mode (CLI prompt/CSV/test-case) and LangSmith mode (dataset name or preloaded examples).
  • Global concurrency control via a single llmCallLimiter (p-limit), used for generation and evaluator LLM calls.
  • Per-operation timeouts via withTimeout() (best-effort; does not cancel underlying requests).
  • Stable scoring and pass/fail status derived from Feedback.kind.
  • Optional artifact output (--output-dir) in local mode.

Evaluators are now “plugins” that don’t care about backends; they just evaluate a workflow + context and return feedback.

2) Shared types + feedback contract

All evaluators emit Feedback[] (evaluations/harness/harness-types.ts):

  • evaluator: stable evaluator id (e.g. llm-judge, pairwise, programmatic)
  • metric: metric key (supports dot-path sub-metrics)
  • score: normalized 0..1
  • kind: score | metric | detail (used by the scorer; detail should not affect overall scoring)
  • comment: optional explanation / violations

This contract is what keeps the harness readable: the runner doesn’t need evaluator-specific coercions or ad-hoc post-processing.

3) LangSmith integration fixes (reliable traces + smaller payloads)

Key decisions in the LangSmith runner:

  • Do not wrap the top-level evaluate() target with traceable(); the LangSmith SDK wraps it and attaches critical options (linking to examples, run lifecycle, client/project settings).
  • Do wrap inner operations with traceable() for child traces and explicitly pass client: lsClient so child traces attach to the same run tree.
  • Flush pending trace batches before returning.

To reduce payload size (and avoid 403 multipart errors on large runs), we keep “minimal tracing” enabled by default and filter heavy state fields via hideInputs/hideOutputs (evaluations/langsmith/trace-filters.ts). We explicitly keep messages untrimmed to avoid breaking downstream expectations.

4) Metric key compatibility for comparing old vs new runs

LangSmith metric key mapping is centralized (evaluations/harness/feedback.ts):

  • LLM-judge metrics are unprefixed (e.g. overallScore, maintainability.workflowOrganization) to match historical dashboards.
  • Programmatic metrics remain prefixed (e.g. programmatic.trigger) so they don’t collide with LLM-judge root keys.
  • Pairwise preserves v1 metric keys (e.g. pairwise_primary, pairwise_total_violations, etc.) as raw counts/ratios as before; additional judge/gen details are namespaced (e.g. pairwise.judge1).

5) Directory structure (separation by responsibility)

The evals folder is reorganized so ownership is obvious:

  • evaluations/cli/: CLI entry + arg parsing + CSV loader
  • evaluations/harness/: runner, lifecycle logging, scoring, artifacts, helper utilities
  • evaluations/evaluators/: evaluator factories (llm-judge, pairwise, programmatic, similarity)
  • evaluations/judge/: LLM-judge internals (schema + category evaluators + workflow evaluator)
  • evaluations/langsmith/: LangSmith helpers (types + trace filters)
  • evaluations/support/: environment setup (LLM + node types + LangSmith client), node loading, reports, test-case generation
  • evaluations/programmatic/python/: unchanged layout (Python similarity tooling)

Legacy/categorization evaluation artifacts were removed as part of cleanup.

Behavior changes / notes

  • Concurrency is now intended to be controlled primarily via --concurrency (global limiter); nested LLM calls are routed through the same limiter to prevent multiplicative parallelism.
  • Local mode can write artifacts via --output-dir (one folder per example + summary.json).
  • LangSmith mode rejects local-only prompt sources (--prompt, --prompts-csv, --test-case) and requires --dataset.

Tests

This PR adds/expands Jest coverage for:

  • CLI parsing edge cases and validation
  • CSV prompt loading (including dos/donts)
  • Local + LangSmith runner behavior (including dataset context extraction)
  • Scoring invariance when new detail metrics are added
  • Concurrency limiting of judge panels / multi-generation via p-limit
  • Trace filter behavior (keeps messages untrimmed while filtering heavy fields)
  • Metric key mapping contracts (LangSmith comparability)

Run from packages/@n8n/ai-workflow-builder.ee:

pnpm lint
pnpm typecheck
pnpm test:eval

How to verify (manual)

From packages/@n8n/ai-workflow-builder.ee:

# Local smoke test
pnpm eval --prompt "Create a workflow that..." --verbose --output-dir ./.data/out/llm-judge-local

# Pairwise local
pnpm eval:pairwise --prompt "Create a workflow that..." --dos "Must use Notion" --donts "No HTTP Request node" --verbose --output-dir ./.data/out/pairwise-local

# LangSmith (requires LANGSMITH_API_KEY)
pnpm eval:langsmith --dataset "workflow-builder-canvas-prompts" --name "v2-smoke" --max-examples 5 --concurrency 5 --verbose

Related Linear tickets, Github issues, and Community forum posts

Review / Merge checklist

  • PR title and summary are descriptive. (conventions)
  • Docs updated or follow-up ticket created.
  • Tests included.
  • PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)

Add trace filtering to avoid 403 errors from oversized LangSmith payloads
during concurrent evaluations. Large fields like cachedTemplates and
parsedNodeTypes are summarized while preserving essential debugging info.

Changes:
- Add trace-filters.ts with hideInputs/hideOutputs filtering logic
- Add resetFilteringStats() to ensure accurate per-run statistics
- Add TRACE_BATCH_SIZE_LIMIT and TRACE_BATCH_CONCURRENCY constants
- Pass custom LangSmith client to evaluate() calls

Filtering is enabled by default. Set LANGSMITH_MINIMAL_TRACING=false
to disable and get full traces.
Replace global state with closure-scoped state per client instance.
This avoids issues with parallel evaluations corrupting shared counters.
Pass EvalLogger through setupTestEnvironment to createTraceFilters
for consistent logging across the evaluation system.
Extract helper functions to reduce cyclomatic complexity:
- summarizeContextField() for context field placeholder strings
- filterWorkflowContext() for workflowContext object filtering
- summarizeLargeWorkflow() for conditional workflow summarization
- trackInputPassthrough() for stat tracking on unchanged inputs
- Add verbose parameter to runLangsmithEvaluation()
- Use EvalLogger for consistent logging output
- Pass logger to setupTestEnvironment() for trace filter logging
- Document CLI options for eval:langsmith in README
Add detailed verbose output showing:
- Judge details: individual verdicts with brief justifications
- Timing breakdown: generation time vs judge time with averages
- Workflow summary: compact node type listing (e.g. "5 nodes (Webhook, IF, HTTP Request x2)")

Works for both local pairwise (--prompt) and LangSmith pairwise modes.
Add --verbose flag to local CLI evaluation (pnpm eval) showing:
- Per-test results as they complete (PASS/WARN/ERROR with score)
- Generation timing
- Workflow summary (node types)
- Key category scores (functionality, connections, config)
- Critical issues if any

In verbose mode, progress bar is replaced with real-time test output.
- Add per-example result logging with prompt, scores, and pass/fail status
- Add summary statistics (pass rate, average scores) at end of evaluation
- Add dataset stats and model info in verbose mode
- Enhance trace filtering: add messages array summarization
- Reduce batch size limit to 2MB and add batchSizeLimit option
- Add input field truncation for large LangChain model inputs
Upgrade to get:
- batchSizeLimit option for controlling runs per batch
- Better multipart upload handling (may fix 403 errors)
- Memory leak fixes and async improvements
- omitTracedRuntimeInfo option for smaller payloads
- Add unified argument parser for all CLI flags
- Add ordered progress reporter for real-time verbose logging
- Add abstract runner base class with template method pattern
- Create LLM-judge runner that moves evaluation INTO target function
  (fixes "Run not created by target function" error in LangSmith 0.4.x)
- Create pairwise runner using new architecture
- Update index.ts to use new unified runners

Key fix: LangSmith 0.4.x requires all LLM calls to happen inside the
traceable target function. The old evaluator was calling LLM chains
outside the traceable context, causing 403 errors.
- Add core interfaces (Evaluator, Feedback, RunConfig, Lifecycle)
- Add runner supporting both local and LangSmith modes
- Add console lifecycle for progress reporting
- Add comprehensive test coverage (55 tests passing)

Key design decisions:
- Factory pattern for evaluator creation
- Pre-computed feedback pattern for LangSmith compatibility
- Parallel evaluators, sequential examples
- Skip and continue on errors
- LLM-judge evaluator wraps existing evaluateWorkflow chain
- Programmatic evaluator wraps rule-based checks
- Add index files for module exports
- All 60 tests passing

Factory pattern enables:
- Easy composition of evaluators
- Parallel execution
- Centralized error handling via runner
- Create runV2Evaluation() function that ties together:
  - Environment setup
  - Workflow generator
  - Evaluator factories
  - Console lifecycle
  - Local and LangSmith modes
- Demonstrates full v2 harness usage
- 60 tests passing

Signed-off-by: Oleg Ivaniv <[email protected]>
- Add pairwise evaluator factory wrapping runJudgePanel()
- Add 9 tests for pairwise evaluator (TDD)
- Fix prompt not being passed to LLM-judge evaluator context
- Fix LangSmith dataset format (support messages[] array)
- Improve verbose output with critical metrics and violations
- Remove truncation from violation output
- Add README.md with mental model and documentation
- Add CLI tests (19 tests) covering loadTestCases, mode selection,
  config building, exit codes, and workflow generator setup
- Add programmatic evaluator tests (9 tests) covering all feedback
  categories and violation formatting
- Expand lifecycle tests (28 additional tests) for verbose output,
  critical metrics display, violations, and merge functions

Coverage improvement:
- cli.ts: 0% → 76%
- programmatic.ts: 0% → 100%
- lifecycle.ts: 66% → 100%
- Overall v2/: 63% → 89%
…e filtering to v2

Artifact saving:
- Add createArtifactSaver() for persisting evaluation results to disk
- Save prompt.txt, workflow.json, feedback.json per example
- Save summary.json with per-evaluator statistics
- 13 tests for output module

Similarity evaluator:
- Add createSimilarityEvaluator() wrapping Python graph edit distance
- Support single and multiple reference workflows
- Support preset configurations (strict/standard/lenient)
- 11 tests for similarity evaluator

Trace filtering:
- Integrate trace filtering into LangSmith mode (enabled by default)
- Add enableTraceFiltering option to LangsmithOptions
- Re-export createTraceFilters and isMinimalTracingEnabled

Coverage: 90% statements, 80% branches, 140 tests passing
Add utilities for analyzing LLM token cache performance:
- calculateCacheStats() - compute stats from token usage metadata
- aggregateCacheStats() - aggregate multiple stats with correct hit rate
- formatCacheStats() - format for display with locale strings

14 tests for cache analyzer module.
Add ability to generate multiple workflows per prompt and aggregate
results for variance reduction in pairwise evaluation.

New features:
- createPairwiseEvaluator({ numGenerations: N }) - generate N workflows
- aggregateGenerations() - calculate generation correctness
- getMajorityThreshold() - majority voting utility
- CLI support via --generations flag

Feedback keys for multi-gen:
- pairwise.generationCorrectness (passing/total)
- pairwise.aggregatedDiagnostic (avg score)
- pairwise.genN.majorityPass/diagnosticScore (per-gen details)

19 new tests for multi-gen functionality.
…se generator to v2

- Add score-calculator.ts with weighted scoring and evaluator grouping
- Add report-generator.ts for markdown report generation
- Add test-case-generator.ts with LLM-based generation and basicTestCases
- Fix --max-examples flag for LangSmith mode in runner-base.ts
- Export all new utilities from v2/index.ts

62 new tests added (235 total for v2)
…aceable()

The LangSmith SDK's evaluate() function checks if the target is already
wrapped with traceable(). When it is, the SDK skips applying critical
defaultOptions (on_end, reference_example_id, client), causing issues like:
- Target function executing multiple times per example
- Missing traces in LangSmith dashboard
- Client mismatch between evaluate() and inner traceable() calls

The fix:
- Do NOT wrap target function with traceable() - let evaluate() handle it
- DO wrap inner operations (like generateWorkflow) with traceable() for
  child trace visibility

Both v2 runner and pairwise generator now follow this harmonized pattern.
- Add trace flushing with awaitPendingTraceBatches() to ensure all traces
  are sent before process exits
- Fix exit code in v2 CLI: LangSmith mode now returns 0 on success since
  results are in the dashboard, not the placeholder summary
- Use limit parameter on listExamples() instead of fetching all and slicing
  to avoid double target execution
- Add --dataset CLI flag for specifying LangSmith dataset name
- Add maxExamples option to LangsmithOptions type
- Fix message type detection in trace-filters to avoid errors with
  unusual message objects
Add comprehensive documentation about LangSmith SDK interactions:
- Root cause of traceable() + evaluate() conflicts
- Correct pattern: don't wrap target, do wrap inner operations
- Environment variables required for tracing
- Trace flushing importance
- Payload size filtering tips
- numRepetitions behavior
- AsyncLocalStorage context tracking
- Client consistency requirements
- Debugging tips and SDK source location

Also updates CLI usage examples and file structure documentation.
Signed-off-by: Oleg Ivaniv <[email protected]>
… mode

- Add logger to RunConfig and pass it from CLI to runner
- Replace console.log calls in runner.ts with proper logger methods:
  - Per-workflow progress → logger.verbose() (only shown with --verbose)
  - Important status messages → logger.info() (always shown)
- Remove [v2] prefix from log messages (no longer needed post-migration)
- Remove redundant console.warn calls from programmatic-evaluation.ts
  (error info is already captured in the returned violation result)

This enables cleaner output by default while preserving detailed logging
via the --verbose flag.
- runner.ts: Handle unknown error type in template literal with proper
  type narrowing (instanceof Error check)
- trace-filters.ts: Replace unsafe type casts with proper type guards
  (hasGetTypeMethod, hasTypeProperty, getTypeName) for safe property access
- test-case-generator.ts: Add Zod-based validation with parseTestCasesOutput()
  to safely type LLM structured output
- test-case-generator.test.ts: Create helper functions with type guards
  to safely extract data from jest mock calls
- output.test.ts: Use jsonParse<T> from n8n-workflow with proper type
  interfaces for type-safe JSON parsing in tests

All fixes use runtime type guards instead of type casting to satisfy
strict TypeScript lint rules.
- Require langsmithClient in LangSmith RunConfig and pass it from CLI setupTestEnvironment
- Remove runner-side setupTestEnvironment call to avoid re-initialization/config drift
- Ensure nested traceable() uses the same client instance via TraceableConfig.client
- Update unit tests for new RunConfig contract
- Prevent NaN in verbose logs, artifacts, and reports by averaging finite scores only\n- Clarify weighting layers (LLM category vs cross-evaluator) with explicit names\n- Tighten LangSmith reference workflow parsing and ignore invalid refs\n- Apply timeouts inside p-limit work; document best-effort cancellation\n- Add focused tests for NaN handling, ref parsing, and limiter-slot timeouts
- Remove cache/usage reporting from eval CLI and LangSmith evaluator\n- Delete unused cache analyzer module and tests\n- Keep full message arrays in LangSmith trace filtering (no summarization)
- Cover header order, do/dont aliases, BOM, headerless CSV, and empty-row handling\n- Add negative-path assertions for missing/empty files and no valid prompts

Signed-off-by: Oleg Ivaniv <[email protected]>
Move eval harness into clear submodules (cli/harness/judge/langsmith/support) and update scripts/docs/tests accordingly.

- Keep python evals under evaluations/programmatic/python unchanged
- Remove categorization-eval artifacts (types + prompts example)
- Fix load-nodes path resolution and update downstream imports
Refresh evaluations README for the v2 harness.

- Add quick start, prerequisites, CSV format, and component map
- Clarify Feedback.kind semantics and LangSmith metric key mapping
- Remove SDK-internals/debug-only LangSmith gotchas (keep only the critical traceable rule)
@OlegIvaniv OlegIvaniv marked this pull request as ready for review January 8, 2026 08:57
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 issues found across 109 files

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:57">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type guard instead of `as` for type narrowing. The pattern `typeof x === 'object' && x !== null` can be encapsulated in a reusable type guard like `isRecord(value): value is Record<string, unknown>`.</violation>

<violation number="2" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:75">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type guard instead of `as` for type narrowing. This is the same pattern as `summarizeWorkflow` - consider creating a shared `isRecord()` type guard.</violation>

<violation number="3" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:141">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type guard instead of `as` for type narrowing. Consider creating a type guard function that validates the structure (e.g., `hasNodes(value): value is { nodes?: unknown[] }`).</violation>

<violation number="4" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:202">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type guard instead of `as` for type narrowing. The `typeof === 'object'` check validates runtime type but a type guard would properly narrow the TypeScript type.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts:306">
P2: Inverted verbose condition: `onEvaluatorError` only logs in non-verbose mode, which is the opposite of other hooks (`onExampleStart`, `onExampleComplete`). Evaluator errors will be silently suppressed in verbose mode.</violation>

<violation number="2" location="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts:392">
P2: Rule violated: **Prefer Typeguards over Type casting**

Use a type predicate instead of `as` for type narrowing after filtering. The `filter(Boolean)` pattern doesn't automatically narrow types in TypeScript - use a type guard function instead.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/evaluators/programmatic/index.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/evaluators/programmatic/index.ts:72">
P2: Missing feedback entries for `nodes` and `credentials` evaluation results. The underlying `programmaticEvaluation` computes and returns both `result.nodes` and `result.credentials` with `.score` and `.violations` properties (same structure as other categories), but these are not included in the feedback array. This means these metric scores won't be visible on dashboards for debugging/analysis, even though they contribute to the overall score.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts:51">
P2: Test description doesn't match assertion: the test says "should return 2 for 2 judges" but the assertion expects 1, not 2.</violation>

<violation number="2" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts:55">
P2: Test description doesn't match assertion: the test says "should return 3 for 4 judges" but the assertion expects 2, not 3.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts:56">
P2: For even numbers of judges, `Math.ceil(numJudges / 2)` returns the threshold for a tie (50%), not a majority (>50%). With 2 judges, this returns 1 (50%), and with 4 judges, it returns 2 (50%). A true majority threshold should use `Math.floor(numJudges / 2) + 1`.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts:401">
P3: This test is redundant with 'should exit with 0 when pass rate >= 70%' above - both use `passed: 7, totalExamples: 10` which is exactly 70%. Consider removing this duplicate test or changing the values to truly test the boundary (e.g., `passed: 70, totalExamples: 100` to make the "exactly 70%" intent clearer).</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts:497">
P2: Test assertion doesn't verify parallelism as the test name claims. `expect(callTimes).toHaveLength(3)` only confirms all calls were made, not that they ran in parallel. Consider checking that all timestamps fall within a small window (e.g., all within 10ms of the first call) to verify true parallel execution.</violation>
</file>

<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts:27">
P2: Test doesn't verify filtering behavior as claimed. The `cachedTemplates` input only contains `templateId` and `name` - the exact properties that `summarizeCachedTemplates` preserves. Consider adding extra properties (e.g., `workflow: { nodes: [] }`) to the template object to verify they are actually filtered out.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

if (!workflow || typeof workflow !== 'object') {
return workflow;
}
const wf = workflow as { nodes?: unknown[] };
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Rule violated: Prefer Typeguards over Type casting

Use a type guard instead of as for type narrowing. Consider creating a type guard function that validates the structure (e.g., hasNodes(value): value is { nodes?: unknown[] }).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 141:

<comment>Use a type guard instead of `as` for type narrowing. Consider creating a type guard function that validates the structure (e.g., `hasNodes(value): value is { nodes?: unknown[] }`).</comment>

<file context>
@@ -0,0 +1,250 @@
+	if (!workflow || typeof workflow !== 'object') {
+		return workflow;
+	}
+	const wf = workflow as { nodes?: unknown[] };
+	if (wf.nodes && wf.nodes.length > WORKFLOW_SUMMARY_THRESHOLD) {
+		return summarizeWorkflow(workflow);
</file context>

✅ Addressed in 3ed9be4

function summarizeCachedTemplates(templates: unknown[]): Array<Record<string, unknown>> {
return templates.map((t) => {
if (!t || typeof t !== 'object') return { unknown: true };
const template = t as Record<string, unknown>;
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Rule violated: Prefer Typeguards over Type casting

Use a type guard instead of as for type narrowing. This is the same pattern as summarizeWorkflow - consider creating a shared isRecord() type guard.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 75:

<comment>Use a type guard instead of `as` for type narrowing. This is the same pattern as `summarizeWorkflow` - consider creating a shared `isRecord()` type guard.</comment>

<file context>
@@ -0,0 +1,250 @@
+function summarizeCachedTemplates(templates: unknown[]): Array<Record<string, unknown>> {
+	return templates.map((t) => {
+		if (!t || typeof t !== 'object') return { unknown: true };
+		const template = t as Record<string, unknown>;
+		return {
+			templateId: template.templateId,
</file context>

✅ Addressed in 3ed9be4

return { unknown: true };
}

const wf = workflow as Record<string, unknown>;
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Rule violated: Prefer Typeguards over Type casting

Use a type guard instead of as for type narrowing. The pattern typeof x === 'object' && x !== null can be encapsulated in a reusable type guard like isRecord(value): value is Record<string, unknown>.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 57:

<comment>Use a type guard instead of `as` for type narrowing. The pattern `typeof x === 'object' && x !== null` can be encapsulated in a reusable type guard like `isRecord(value): value is Record<string, unknown>`.</comment>

<file context>
@@ -0,0 +1,250 @@
+		return { unknown: true };
+	}
+
+	const wf = workflow as Record<string, unknown>;
+	const nodes = wf.nodes as Array<{ name?: string }> | undefined;
+	const connections = wf.connections as Record<string, unknown> | undefined;
</file context>

✅ Addressed in 3ed9be4

// Handle workflowContext if present
if (filtered.workflowContext && typeof filtered.workflowContext === 'object') {
filtered.workflowContext = filterWorkflowContext(
filtered.workflowContext as Record<string, unknown>,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Rule violated: Prefer Typeguards over Type casting

Use a type guard instead of as for type narrowing. The typeof === 'object' check validates runtime type but a type guard would properly narrow the TypeScript type.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 202:

<comment>Use a type guard instead of `as` for type narrowing. The `typeof === 'object'` check validates runtime type but a type guard would properly narrow the TypeScript type.</comment>

<file context>
@@ -0,0 +1,250 @@
+		// Handle workflowContext if present
+		if (filtered.workflowContext && typeof filtered.workflowContext === 'object') {
+			filtered.workflowContext = filterWorkflowContext(
+				filtered.workflowContext as Record<string, unknown>,
+			);
+		}
</file context>

✅ Addressed in 3ed9be4

export function mergeLifecycles(
...lifecycles: Array<Partial<EvaluationLifecycle> | undefined>
): EvaluationLifecycle {
const validLifecycles = lifecycles.filter(Boolean) as Array<Partial<EvaluationLifecycle>>;
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Rule violated: Prefer Typeguards over Type casting

Use a type predicate instead of as for type narrowing after filtering. The filter(Boolean) pattern doesn't automatically narrow types in TypeScript - use a type guard function instead.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts, line 392:

<comment>Use a type predicate instead of `as` for type narrowing after filtering. The `filter(Boolean)` pattern doesn't automatically narrow types in TypeScript - use a type guard function instead.</comment>

<file context>
@@ -0,0 +1,437 @@
+export function mergeLifecycles(
+	...lifecycles: Array<Partial<EvaluationLifecycle> | undefined>
+): EvaluationLifecycle {
+	const validLifecycles = lifecycles.filter(Boolean) as Array<Partial<EvaluationLifecycle>>;
+
+	return {
</file context>
Suggested change
const validLifecycles = lifecycles.filter(Boolean) as Array<Partial<EvaluationLifecycle>>;
const validLifecycles = lifecycles.filter(
(lc): lc is Partial<EvaluationLifecycle> => Boolean(lc),
);

✅ Addressed in 3ed9be4

expect(getMajorityThreshold(5)).toBe(3);
});

it('should return 2 for 2 judges', () => {
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Test description doesn't match assertion: the test says "should return 2 for 2 judges" but the assertion expects 1, not 2.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts, line 51:

<comment>Test description doesn't match assertion: the test says "should return 2 for 2 judges" but the assertion expects 1, not 2.</comment>

<file context>
@@ -0,0 +1,159 @@
+			expect(getMajorityThreshold(5)).toBe(3);
+		});
+
+		it('should return 2 for 2 judges', () => {
+			expect(getMajorityThreshold(2)).toBe(1);
+		});
</file context>
Suggested change
it('should return 2 for 2 judges', () => {
it('should return 1 for 2 judges', () => {

✅ Addressed in 3ed9be4

if (!Number.isFinite(numJudges) || numJudges < 1) {
throw new Error(`numJudges must be >= 1 (received ${String(numJudges)})`);
}
return Math.ceil(numJudges / 2);
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: For even numbers of judges, Math.ceil(numJudges / 2) returns the threshold for a tie (50%), not a majority (>50%). With 2 judges, this returns 1 (50%), and with 4 judges, it returns 2 (50%). A true majority threshold should use Math.floor(numJudges / 2) + 1.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts, line 56:

<comment>For even numbers of judges, `Math.ceil(numJudges / 2)` returns the threshold for a tie (50%), not a majority (>50%). With 2 judges, this returns 1 (50%), and with 4 judges, it returns 2 (50%). A true majority threshold should use `Math.floor(numJudges / 2) + 1`.</comment>

<file context>
@@ -0,0 +1,97 @@
+	if (!Number.isFinite(numJudges) || numJudges < 1) {
+		throw new Error(`numJudges must be >= 1 (received ${String(numJudges)})`);
+	}
+	return Math.ceil(numJudges / 2);
+}
+
</file context>
Suggested change
return Math.ceil(numJudges / 2);
return Math.floor(numJudges / 2) + 1;
Fix with Cubic

generateWorkflow: mockGenerateWorkflow,
});

expect(callTimes).toHaveLength(3);
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Test assertion doesn't verify parallelism as the test name claims. expect(callTimes).toHaveLength(3) only confirms all calls were made, not that they ran in parallel. Consider checking that all timestamps fall within a small window (e.g., all within 10ms of the first call) to verify true parallel execution.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts, line 497:

<comment>Test assertion doesn't verify parallelism as the test name claims. `expect(callTimes).toHaveLength(3)` only confirms all calls were made, not that they ran in parallel. Consider checking that all timestamps fall within a small window (e.g., all within 10ms of the first call) to verify true parallel execution.</comment>

<file context>
@@ -0,0 +1,500 @@
+				generateWorkflow: mockGenerateWorkflow,
+			});
+
+			expect(callTimes).toHaveLength(3);
+		});
+	});
</file context>
Fix with Cubic


const msg = { type: 'ai', content: 'hello' };
const input: KVMap = {
cachedTemplates: [{ templateId: 't1', name: 'Template' }],
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Test doesn't verify filtering behavior as claimed. The cachedTemplates input only contains templateId and name - the exact properties that summarizeCachedTemplates preserves. Consider adding extra properties (e.g., workflow: { nodes: [] }) to the template object to verify they are actually filtered out.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts, line 27:

<comment>Test doesn't verify filtering behavior as claimed. The `cachedTemplates` input only contains `templateId` and `name` - the exact properties that `summarizeCachedTemplates` preserves. Consider adding extra properties (e.g., `workflow: { nodes: [] }`) to the template object to verify they are actually filtered out.</comment>

<file context>
@@ -0,0 +1,38 @@
+
+		const msg = { type: 'ai', content: 'hello' };
+		const input: KVMap = {
+			cachedTemplates: [{ templateId: 't1', name: 'Template' }],
+			messages: [msg],
+		};
</file context>
Suggested change
cachedTemplates: [{ templateId: 't1', name: 'Template' }],
cachedTemplates: [{ templateId: 't1', name: 'Template', workflow: { nodes: [] }, fullDefinition: 'large data' }],

✅ Addressed in 3ed9be4

await expect(runV2Evaluation()).rejects.toThrow('process.exit(1)');
});

it('should exit with 0 when pass rate is exactly 70%', async () => {
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This test is redundant with 'should exit with 0 when pass rate >= 70%' above - both use passed: 7, totalExamples: 10 which is exactly 70%. Consider removing this duplicate test or changing the values to truly test the boundary (e.g., passed: 70, totalExamples: 100 to make the "exactly 70%" intent clearer).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts, line 401:

<comment>This test is redundant with 'should exit with 0 when pass rate >= 70%' above - both use `passed: 7, totalExamples: 10` which is exactly 70%. Consider removing this duplicate test or changing the values to truly test the boundary (e.g., `passed: 70, totalExamples: 100` to make the "exactly 70%" intent clearer).</comment>

<file context>
@@ -0,0 +1,459 @@
+				await expect(runV2Evaluation()).rejects.toThrow('process.exit(1)');
+			});
+
+			it('should exit with 0 when pass rate is exactly 70%', async () => {
+				mockRunEvaluation.mockResolvedValue(createMockSummary({ totalExamples: 10, passed: 7 }));
+
</file context>
Fix with Cubic

- Make artifact output folder names concurrency-safe (index + deterministic short id)
- Enable --output-dir artifact writing in LangSmith mode
- Keep summary results stable by sorting by example index
Copy link
Contributor

@burivuhster burivuhster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor! And kudos for the test coverage.
Couple of suggestions/comments

Comment on lines +180 to +192
// Build context - include generateWorkflow for multi-gen pairwise
const isMultiGen = args.suite === 'pairwise' && args.numGenerations > 1;
const llmCallLimiter = pLimit(args.concurrency);

const baseConfig = {
generateWorkflow,
evaluators,
lifecycle,
logger,
outputDir: args.outputDir,
timeoutMs: args.timeoutMs,
context: isMultiGen ? { generateWorkflow, llmCallLimiter } : { llmCallLimiter },
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason we need to pass generateWorkflow twice in the config?

mode: 'langsmith',
dataset: args.datasetName ?? getDefaultDatasetName(args.suite),
langsmithClient: (() => {
if (!env.lsClient) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this check to environment init logic?

donts?: string;

numJudges: number;
numGenerations: number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify this and use --repetitions parameter everywhere instead?

Comment on lines 39 to 103
export function loadTestCasesFromCsv(csvPath: string): TestCase[] {
const resolvedPath = path.isAbsolute(csvPath) ? csvPath : path.resolve(process.cwd(), csvPath);

if (!existsSync(resolvedPath)) {
throw new Error(`CSV file not found at ${resolvedPath}`);
}

const fileContents = readFileSync(resolvedPath, 'utf8');
const rows = parseCsv(fileContents);

if (rows.length === 0) {
throw new Error('The provided CSV file is empty');
}

let header: ParsedCsvRow | undefined;
let dataRows = rows;

if (isHeaderRow(rows[0])) {
header = rows[0]!;
dataRows = rows.slice(1);
}

if (dataRows.length === 0) {
throw new Error('No prompt rows found in the provided CSV file');
}

const promptIndex = header ? (detectColumnIndex(header, 'prompt') ?? 0) : 0;
const idIndex = header ? detectColumnIndex(header, 'id') : undefined;
const nameIndex = header
? (detectColumnIndex(header, 'name') ?? detectColumnIndex(header, 'title'))
const dosIndex = header
? (detectColumnIndex(header, 'dos') ?? detectColumnIndex(header, 'do'))
: undefined;
const dontsIndex = header
? (detectColumnIndex(header, 'donts') ?? detectColumnIndex(header, 'dont'))
: undefined;

const testCases = dataRows
.map<TestCase | undefined>((row, index) => {
const prompt = sanitizeValue(row[promptIndex]);
if (!prompt) {
return undefined;
}

const idSource = sanitizeValue(idIndex !== undefined ? row[idIndex] : undefined);
const nameSource = sanitizeValue(nameIndex !== undefined ? row[nameIndex] : undefined);
const dosSource = sanitizeValue(dosIndex !== undefined ? row[dosIndex] : undefined);
const dontsSource = sanitizeValue(dontsIndex !== undefined ? row[dontsIndex] : undefined);

return {
id: idSource || `csv-case-${index + 1}`,
name: nameSource || generateNameFromPrompt(prompt, index),
...(idSource ? { id: idSource } : { id: `csv-case-${index + 1}` }),
prompt,
...((dosSource || dontsSource) && {
context: {
...(dosSource ? { dos: dosSource } : {}),
...(dontsSource ? { donts: dontsSource } : {}),
},
}),
};
})
.filter((testCase): testCase is TestCase => testCase !== undefined);

if (testCases.length === 0) {
throw new Error('No valid prompts found in the provided CSV file');
}

return testCases;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a bit hard to read because of many undefineds, ternary and spread operators and different logic depending on whether we have a header.
Here's the Claude's take on how it can be siplified:

export function loadTestCasesFromCsv(csvPath: string): TestCase[] {
  const resolvedPath = path.isAbsolute(csvPath) ? csvPath : path.resolve(process.cwd(), csvPath);

  if (!existsSync(resolvedPath)) {
    throw new Error(`CSV file not found at ${resolvedPath}`);
  }

  const rows = parseCsv(readFileSync(resolvedPath, 'utf8'));

  if (rows.length === 0) {
    throw new Error('The provided CSV file is empty');
  }

  const header = isHeaderRow(rows[0]) ? rows[0] : undefined;
  const dataRows = header ? rows.slice(1) : rows;

  if (dataRows.length === 0) {
    throw new Error('No prompt rows found in the provided CSV file');
  }

  const col = (name: string, ...aliases: string[]) => {
    if (!header) return undefined;
    for (const n of [name, ...aliases]) {
      const idx = detectColumnIndex(header, n);
      if (idx !== undefined) return idx;
    }
    return undefined;
  };

  const promptIdx = col('prompt') ?? 0;
  const idIdx = col('id');
  const dosIdx = col('dos', 'do');
  const dontsIdx = col('donts', 'dont');

  const testCases: TestCase[] = [];

  for (let i = 0; i < dataRows.length; i++) {
    const row = dataRows[i];
    const prompt = sanitizeValue(row[promptIdx]);
    if (!prompt) continue;

    const testCase: TestCase = {
      id: sanitizeValue(row[idIdx!]) || `csv-case-${i + 1}`,
      prompt,
    };

    const dos = sanitizeValue(row[dosIdx!]);
    const donts = sanitizeValue(row[dontsIdx!]);

    if (dos || donts) {
      testCase.context = {};
      if (dos) testCase.context.dos = dos;
      if (donts) testCase.context.donts = donts;
    }

    testCases.push(testCase);
  }

  if (testCases.length === 0) {
    throw new Error('No valid prompts found in the provided CSV file');
  }

  return testCases;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this whole judge/evaluators directory be moved under evaluators/llm-judge?

Comment on lines 22 to 25
/** Optional reference workflow for similarity-based checks */
referenceWorkflow?: SimpleWorkflow;
/** Optional reference workflows for similarity-based checks (best match wins) */
referenceWorkflows?: SimpleWorkflow[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be a single array referenceWorkflows, consumers may pass array of a single item instead of referenceWorkflow

* Optional generator for multi-generation evaluations.
* When present, pairwise evaluator can generate multiple workflows from the same prompt.
*/
generateWorkflow?: (prompt: string) => Promise<SimpleWorkflow>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much work is to move workflow generation upper in the call chain, but it would be nice to make evaluator agnostic of the multi-generation and just always process a single generation. It would require some higher-level code doing the metrics aggregation though.

hash = Math.imul(hash, 0x01000193);
}
return hash >>> 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

Comment on lines 799 to 813
);
const evalDurationMs = Date.now() - evalStart;

const totalDurationMs = Date.now() - startTime;
const score = calculateExampleScore(feedback);
const status = hasErrorFeedback(feedback)
? 'error'
: determineStatus({ score, passThreshold });
stats.total++;
stats.scoreSum += score;
stats.durationSumMs += totalDurationMs;
if (status === 'pass') stats.passed++;
else if (status === 'fail') stats.failed++;
else stats.errors++;
const result: ExampleResult = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
);
const evalDurationMs = Date.now() - evalStart;
const totalDurationMs = Date.now() - startTime;
const score = calculateExampleScore(feedback);
const status = hasErrorFeedback(feedback)
? 'error'
: determineStatus({ score, passThreshold });
stats.total++;
stats.scoreSum += score;
stats.durationSumMs += totalDurationMs;
if (status === 'pass') stats.passed++;
else if (status === 'fail') stats.failed++;
else stats.errors++;
const result: ExampleResult = {
);
const evalDurationMs = Date.now() - evalStart;
const totalDurationMs = Date.now() - startTime;
const score = calculateExampleScore(feedback);
const status = hasErrorFeedback(feedback)
? 'error'
: determineStatus({ score, passThreshold });
stats.total++;
stats.scoreSum += score;
stats.durationSumMs += totalDurationMs;
if (status === 'pass') stats.passed++;
else if (status === 'fail') stats.failed++;
else stats.errors++;
const result: ExampleResult = {

(Just line breaks to improve readability)

Comment on lines 43 to 45
│ runEvaluation(config) │
│ │
│ Config contains: │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASCII-art schemes have rough edges 😅

Copy link
Contributor

@mike12345567 mike12345567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great cleanup, really tidies this up which was starting to get a little out of control!

Left a few comments I think we should consider addressing, one other thing as well, I think it would be nice to maintain some of the old functionality we had, or at least have a package.json command to achieve it. Being able to run the matrix suite with our old prompts could be useful, because they've been a benchmark so far as we have an idea of how they score, I think it would also be nice to have a command for the old default output directory and markdown report generation - the JSON is really useful, but its also handy to have a quick report we can look through.

Really like the generation splitting up the workflow JSON and feedback JSON, thats a massive pain reduction, always found myself trying to scrape the correct workflow out of the massive results JSON!

fb('maintainability.modularity', result.maintainability.modularity, 'detail'),

// Overall score
fb('overallScore', result.overallScore, 'score', result.summary),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also emit the score for the bestPractices?

* Weights should sum to approximately 1.0.
*/
export const DEFAULT_EVALUATOR_WEIGHTS: ScoreWeights = {
'llm-judge': 0.4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the similarity evals need a weighting as well?

@@ -0,0 +1,331 @@
import { mock } from 'jest-mock-extended';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test suite useful, I don't think we usually do type testing, I'm wondering if its just an over-reach by Claude!


#### 3. Workflow Evaluator (`chains/workflow-evaluator.ts`)
# Local: LLM-judge + programmatic
pnpm eval --prompt "Create a workflow that..." --verbose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not have an option now to run with our standard set of eval prompts as we did before?


type FlagDef = { key: CliKey; kind: CliValueKind };

const FLAG_TO_KEY: Record<string, FlagDef> = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a --help command to list some of this information? Tried it locally, think it could be helpful.

Signed-off-by: Oleg Ivaniv <[email protected]>
@blacksmith-sh

This comment has been minimized.

Use yargs with schema validation and help/alias handling, drop bespoke parsing test, and tidy imports.

Signed-off-by: Oleg Ivaniv <[email protected]>
- remove yargs dependency and restore the manual argument parser
- add focused parser tests for numeric flags, filters, and edge cases
- clean up runner debug logging

Signed-off-by: Oleg Ivaniv <[email protected]>
Move the lsClient check to happen right after environment setup
rather than inline during config building. This fails fast when
LANGSMITH_API_KEY is missing in langsmith mode.
- Add findColumn() helper to handle column aliases and header checks
- Add getCell() helper to eliminate awkward ternary patterns
- Replace map/filter chain with explicit for-loop
- Build context conditionally with if-statements instead of nested spreads
Move all LLM judge-related files from evaluations/judge/ into
evaluations/evaluators/llm-judge/ for better organization.

- Move judge/evaluators/* to evaluators/llm-judge/evaluators/
- Move judge/evaluation.ts and workflow-evaluator.ts
- Update all import paths

Signed-off-by: Oleg Ivaniv <[email protected]>
Unify the API to only use referenceWorkflows (array) instead of having
both referenceWorkflow (single) and referenceWorkflows (array). This
simplifies the interface and reduces confusion.

Added backwards compatibility in runner.ts to handle legacy datasets
that still use the single referenceWorkflow field.
Use Node's built-in crypto module for generating deterministic short IDs
instead of manual bit manipulation. Much more readable while providing
the same functionality.
Separate logical groups in processExample with blank lines:
- duration calculations
- score and status
- stats updates
- result creation
Mermaid diagrams render cleanly on GitHub and avoid alignment issues
with ASCII art.
Add explicit weight for similarity evaluator (0.15) and rebalance:
- llm-judge: 0.35 (was 0.4)
- programmatic: 0.25 (was 0.3)
- pairwise: 0.25 (was 0.3)
- similarity: 0.15 (new)
Move default test cases from hardcoded TypeScript array to a CSV fixture
file at fixtures/default-prompts.csv. This makes it easier to modify test
cases without code changes and provides a cleaner separation of data.

- Add fixtures/default-prompts.csv with 10 default test prompts
- Add loadDefaultTestCases() and getDefaultTestCaseIds() helpers
- Update CLI to use CSV fixture instead of basicTestCases
- Update README to document default prompts usage
- Remove basicTestCases export from test-case-generator.ts
Auto-generate help text from flag definitions with descriptions and groups.
Flags are organized by category: Input, Evaluation, Pairwise, LangSmith,
Output, Feature Flags, and Advanced.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 41 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/README.md">

<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/README.md:81">
P3: The diagram is missing `'detail'` from the `kind` type. The actual `Feedback` interface (and the Feedback section later in this file) defines `kind: 'score' | 'metric' | 'detail'`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

F1["evaluator: string"]
F2["metric: string"]
F3["score: 0-1"]
F4["kind: 'score' | 'metric'"]
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The diagram is missing 'detail' from the kind type. The actual Feedback interface (and the Feedback section later in this file) defines kind: 'score' | 'metric' | 'detail'.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/README.md, line 81:

<comment>The diagram is missing `'detail'` from the `kind` type. The actual `Feedback` interface (and the Feedback section later in this file) defines `kind: 'score' | 'metric' | 'detail'`.</comment>

<file context>
@@ -38,47 +44,43 @@ popd
+        F1["evaluator: string"]
+        F2["metric: string"]
+        F3["score: 0-1"]
+        F4["kind: 'score' | 'metric'"]
+        F5["comment?: string"]
+    end
</file context>
Suggested change
F4["kind: 'score' | 'metric'"]
F4["kind: 'score' | 'metric' | 'detail'"]
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

n8n team Authored by the n8n team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants