-
Notifications
You must be signed in to change notification settings - Fork 53.3k
refactor(ai-builder): Implement unified evaluations harness #23955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add trace filtering to avoid 403 errors from oversized LangSmith payloads during concurrent evaluations. Large fields like cachedTemplates and parsedNodeTypes are summarized while preserving essential debugging info. Changes: - Add trace-filters.ts with hideInputs/hideOutputs filtering logic - Add resetFilteringStats() to ensure accurate per-run statistics - Add TRACE_BATCH_SIZE_LIMIT and TRACE_BATCH_CONCURRENCY constants - Pass custom LangSmith client to evaluate() calls Filtering is enabled by default. Set LANGSMITH_MINIMAL_TRACING=false to disable and get full traces.
Replace global state with closure-scoped state per client instance. This avoids issues with parallel evaluations corrupting shared counters.
Pass EvalLogger through setupTestEnvironment to createTraceFilters for consistent logging across the evaluation system.
Extract helper functions to reduce cyclomatic complexity: - summarizeContextField() for context field placeholder strings - filterWorkflowContext() for workflowContext object filtering - summarizeLargeWorkflow() for conditional workflow summarization - trackInputPassthrough() for stat tracking on unchanged inputs
- Add verbose parameter to runLangsmithEvaluation() - Use EvalLogger for consistent logging output - Pass logger to setupTestEnvironment() for trace filter logging - Document CLI options for eval:langsmith in README
Add detailed verbose output showing: - Judge details: individual verdicts with brief justifications - Timing breakdown: generation time vs judge time with averages - Workflow summary: compact node type listing (e.g. "5 nodes (Webhook, IF, HTTP Request x2)") Works for both local pairwise (--prompt) and LangSmith pairwise modes.
Add --verbose flag to local CLI evaluation (pnpm eval) showing: - Per-test results as they complete (PASS/WARN/ERROR with score) - Generation timing - Workflow summary (node types) - Key category scores (functionality, connections, config) - Critical issues if any In verbose mode, progress bar is replaced with real-time test output.
- Add per-example result logging with prompt, scores, and pass/fail status - Add summary statistics (pass rate, average scores) at end of evaluation - Add dataset stats and model info in verbose mode - Enhance trace filtering: add messages array summarization - Reduce batch size limit to 2MB and add batchSizeLimit option - Add input field truncation for large LangChain model inputs
Upgrade to get: - batchSizeLimit option for controlling runs per batch - Better multipart upload handling (may fix 403 errors) - Memory leak fixes and async improvements - omitTracedRuntimeInfo option for smaller payloads
- Add unified argument parser for all CLI flags - Add ordered progress reporter for real-time verbose logging - Add abstract runner base class with template method pattern - Create LLM-judge runner that moves evaluation INTO target function (fixes "Run not created by target function" error in LangSmith 0.4.x) - Create pairwise runner using new architecture - Update index.ts to use new unified runners Key fix: LangSmith 0.4.x requires all LLM calls to happen inside the traceable target function. The old evaluator was calling LLM chains outside the traceable context, causing 403 errors.
- Add core interfaces (Evaluator, Feedback, RunConfig, Lifecycle) - Add runner supporting both local and LangSmith modes - Add console lifecycle for progress reporting - Add comprehensive test coverage (55 tests passing) Key design decisions: - Factory pattern for evaluator creation - Pre-computed feedback pattern for LangSmith compatibility - Parallel evaluators, sequential examples - Skip and continue on errors
- LLM-judge evaluator wraps existing evaluateWorkflow chain - Programmatic evaluator wraps rule-based checks - Add index files for module exports - All 60 tests passing Factory pattern enables: - Easy composition of evaluators - Parallel execution - Centralized error handling via runner
- Create runV2Evaluation() function that ties together: - Environment setup - Workflow generator - Evaluator factories - Console lifecycle - Local and LangSmith modes - Demonstrates full v2 harness usage - 60 tests passing Signed-off-by: Oleg Ivaniv <[email protected]>
- Add pairwise evaluator factory wrapping runJudgePanel() - Add 9 tests for pairwise evaluator (TDD) - Fix prompt not being passed to LLM-judge evaluator context - Fix LangSmith dataset format (support messages[] array) - Improve verbose output with critical metrics and violations - Remove truncation from violation output - Add README.md with mental model and documentation
- Add CLI tests (19 tests) covering loadTestCases, mode selection, config building, exit codes, and workflow generator setup - Add programmatic evaluator tests (9 tests) covering all feedback categories and violation formatting - Expand lifecycle tests (28 additional tests) for verbose output, critical metrics display, violations, and merge functions Coverage improvement: - cli.ts: 0% → 76% - programmatic.ts: 0% → 100% - lifecycle.ts: 66% → 100% - Overall v2/: 63% → 89%
…e filtering to v2 Artifact saving: - Add createArtifactSaver() for persisting evaluation results to disk - Save prompt.txt, workflow.json, feedback.json per example - Save summary.json with per-evaluator statistics - 13 tests for output module Similarity evaluator: - Add createSimilarityEvaluator() wrapping Python graph edit distance - Support single and multiple reference workflows - Support preset configurations (strict/standard/lenient) - 11 tests for similarity evaluator Trace filtering: - Integrate trace filtering into LangSmith mode (enabled by default) - Add enableTraceFiltering option to LangsmithOptions - Re-export createTraceFilters and isMinimalTracingEnabled Coverage: 90% statements, 80% branches, 140 tests passing
Add utilities for analyzing LLM token cache performance: - calculateCacheStats() - compute stats from token usage metadata - aggregateCacheStats() - aggregate multiple stats with correct hit rate - formatCacheStats() - format for display with locale strings 14 tests for cache analyzer module.
Add ability to generate multiple workflows per prompt and aggregate
results for variance reduction in pairwise evaluation.
New features:
- createPairwiseEvaluator({ numGenerations: N }) - generate N workflows
- aggregateGenerations() - calculate generation correctness
- getMajorityThreshold() - majority voting utility
- CLI support via --generations flag
Feedback keys for multi-gen:
- pairwise.generationCorrectness (passing/total)
- pairwise.aggregatedDiagnostic (avg score)
- pairwise.genN.majorityPass/diagnosticScore (per-gen details)
19 new tests for multi-gen functionality.
…se generator to v2 - Add score-calculator.ts with weighted scoring and evaluator grouping - Add report-generator.ts for markdown report generation - Add test-case-generator.ts with LLM-based generation and basicTestCases - Fix --max-examples flag for LangSmith mode in runner-base.ts - Export all new utilities from v2/index.ts 62 new tests added (235 total for v2)
…aceable() The LangSmith SDK's evaluate() function checks if the target is already wrapped with traceable(). When it is, the SDK skips applying critical defaultOptions (on_end, reference_example_id, client), causing issues like: - Target function executing multiple times per example - Missing traces in LangSmith dashboard - Client mismatch between evaluate() and inner traceable() calls The fix: - Do NOT wrap target function with traceable() - let evaluate() handle it - DO wrap inner operations (like generateWorkflow) with traceable() for child trace visibility Both v2 runner and pairwise generator now follow this harmonized pattern.
- Add trace flushing with awaitPendingTraceBatches() to ensure all traces are sent before process exits - Fix exit code in v2 CLI: LangSmith mode now returns 0 on success since results are in the dashboard, not the placeholder summary - Use limit parameter on listExamples() instead of fetching all and slicing to avoid double target execution - Add --dataset CLI flag for specifying LangSmith dataset name - Add maxExamples option to LangsmithOptions type - Fix message type detection in trace-filters to avoid errors with unusual message objects
Add comprehensive documentation about LangSmith SDK interactions: - Root cause of traceable() + evaluate() conflicts - Correct pattern: don't wrap target, do wrap inner operations - Environment variables required for tracing - Trace flushing importance - Payload size filtering tips - numRepetitions behavior - AsyncLocalStorage context tracking - Client consistency requirements - Debugging tips and SDK source location Also updates CLI usage examples and file structure documentation.
Signed-off-by: Oleg Ivaniv <[email protected]>
… mode - Add logger to RunConfig and pass it from CLI to runner - Replace console.log calls in runner.ts with proper logger methods: - Per-workflow progress → logger.verbose() (only shown with --verbose) - Important status messages → logger.info() (always shown) - Remove [v2] prefix from log messages (no longer needed post-migration) - Remove redundant console.warn calls from programmatic-evaluation.ts (error info is already captured in the returned violation result) This enables cleaner output by default while preserving detailed logging via the --verbose flag.
- runner.ts: Handle unknown error type in template literal with proper type narrowing (instanceof Error check) - trace-filters.ts: Replace unsafe type casts with proper type guards (hasGetTypeMethod, hasTypeProperty, getTypeName) for safe property access - test-case-generator.ts: Add Zod-based validation with parseTestCasesOutput() to safely type LLM structured output - test-case-generator.test.ts: Create helper functions with type guards to safely extract data from jest mock calls - output.test.ts: Use jsonParse<T> from n8n-workflow with proper type interfaces for type-safe JSON parsing in tests All fixes use runtime type guards instead of type casting to satisfy strict TypeScript lint rules.
- Require langsmithClient in LangSmith RunConfig and pass it from CLI setupTestEnvironment - Remove runner-side setupTestEnvironment call to avoid re-initialization/config drift - Ensure nested traceable() uses the same client instance via TraceableConfig.client - Update unit tests for new RunConfig contract
- Prevent NaN in verbose logs, artifacts, and reports by averaging finite scores only\n- Clarify weighting layers (LLM category vs cross-evaluator) with explicit names\n- Tighten LangSmith reference workflow parsing and ignore invalid refs\n- Apply timeouts inside p-limit work; document best-effort cancellation\n- Add focused tests for NaN handling, ref parsing, and limiter-slot timeouts
- Remove cache/usage reporting from eval CLI and LangSmith evaluator\n- Delete unused cache analyzer module and tests\n- Keep full message arrays in LangSmith trace filtering (no summarization)
- Cover header order, do/dont aliases, BOM, headerless CSV, and empty-row handling\n- Add negative-path assertions for missing/empty files and no valid prompts Signed-off-by: Oleg Ivaniv <[email protected]>
Move eval harness into clear submodules (cli/harness/judge/langsmith/support) and update scripts/docs/tests accordingly. - Keep python evals under evaluations/programmatic/python unchanged - Remove categorization-eval artifacts (types + prompts example) - Fix load-nodes path resolution and update downstream imports
Refresh evaluations README for the v2 harness. - Add quick start, prerequisites, CSV format, and component map - Clarify Feedback.kind semantics and LangSmith metric key mapping - Remove SDK-internals/debug-only LangSmith gotchas (keep only the critical traceable rule)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13 issues found across 109 files
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts">
<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:57">
P2: Rule violated: **Prefer Typeguards over Type casting**
Use a type guard instead of `as` for type narrowing. The pattern `typeof x === 'object' && x !== null` can be encapsulated in a reusable type guard like `isRecord(value): value is Record<string, unknown>`.</violation>
<violation number="2" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:75">
P2: Rule violated: **Prefer Typeguards over Type casting**
Use a type guard instead of `as` for type narrowing. This is the same pattern as `summarizeWorkflow` - consider creating a shared `isRecord()` type guard.</violation>
<violation number="3" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:141">
P2: Rule violated: **Prefer Typeguards over Type casting**
Use a type guard instead of `as` for type narrowing. Consider creating a type guard function that validates the structure (e.g., `hasNodes(value): value is { nodes?: unknown[] }`).</violation>
<violation number="4" location="packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts:202">
P2: Rule violated: **Prefer Typeguards over Type casting**
Use a type guard instead of `as` for type narrowing. The `typeof === 'object'` check validates runtime type but a type guard would properly narrow the TypeScript type.</violation>
</file>
<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts">
<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts:306">
P2: Inverted verbose condition: `onEvaluatorError` only logs in non-verbose mode, which is the opposite of other hooks (`onExampleStart`, `onExampleComplete`). Evaluator errors will be silently suppressed in verbose mode.</violation>
<violation number="2" location="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts:392">
P2: Rule violated: **Prefer Typeguards over Type casting**
Use a type predicate instead of `as` for type narrowing after filtering. The `filter(Boolean)` pattern doesn't automatically narrow types in TypeScript - use a type guard function instead.</violation>
</file>
<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/evaluators/programmatic/index.ts">
<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/evaluators/programmatic/index.ts:72">
P2: Missing feedback entries for `nodes` and `credentials` evaluation results. The underlying `programmaticEvaluation` computes and returns both `result.nodes` and `result.credentials` with `.score` and `.violations` properties (same structure as other categories), but these are not included in the feedback array. This means these metric scores won't be visible on dashboards for debugging/analysis, even though they contribute to the overall score.</violation>
</file>
<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts">
<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts:51">
P2: Test description doesn't match assertion: the test says "should return 2 for 2 judges" but the assertion expects 1, not 2.</violation>
<violation number="2" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts:55">
P2: Test description doesn't match assertion: the test says "should return 3 for 4 judges" but the assertion expects 2, not 3.</violation>
</file>
<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts">
<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts:56">
P2: For even numbers of judges, `Math.ceil(numJudges / 2)` returns the threshold for a tie (50%), not a majority (>50%). With 2 judges, this returns 1 (50%), and with 4 judges, it returns 2 (50%). A true majority threshold should use `Math.floor(numJudges / 2) + 1`.</violation>
</file>
<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts">
<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts:401">
P3: This test is redundant with 'should exit with 0 when pass rate >= 70%' above - both use `passed: 7, totalExamples: 10` which is exactly 70%. Consider removing this duplicate test or changing the values to truly test the boundary (e.g., `passed: 70, totalExamples: 100` to make the "exactly 70%" intent clearer).</violation>
</file>
<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts">
<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts:497">
P2: Test assertion doesn't verify parallelism as the test name claims. `expect(callTimes).toHaveLength(3)` only confirms all calls were made, not that they ran in parallel. Consider checking that all timestamps fall within a small window (e.g., all within 10ms of the first call) to verify true parallel execution.</violation>
</file>
<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts">
<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts:27">
P2: Test doesn't verify filtering behavior as claimed. The `cachedTemplates` input only contains `templateId` and `name` - the exact properties that `summarizeCachedTemplates` preserves. Consider adding extra properties (e.g., `workflow: { nodes: [] }`) to the template object to verify they are actually filtered out.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| if (!workflow || typeof workflow !== 'object') { | ||
| return workflow; | ||
| } | ||
| const wf = workflow as { nodes?: unknown[] }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Rule violated: Prefer Typeguards over Type casting
Use a type guard instead of as for type narrowing. Consider creating a type guard function that validates the structure (e.g., hasNodes(value): value is { nodes?: unknown[] }).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 141:
<comment>Use a type guard instead of `as` for type narrowing. Consider creating a type guard function that validates the structure (e.g., `hasNodes(value): value is { nodes?: unknown[] }`).</comment>
<file context>
@@ -0,0 +1,250 @@
+ if (!workflow || typeof workflow !== 'object') {
+ return workflow;
+ }
+ const wf = workflow as { nodes?: unknown[] };
+ if (wf.nodes && wf.nodes.length > WORKFLOW_SUMMARY_THRESHOLD) {
+ return summarizeWorkflow(workflow);
</file context>
✅ Addressed in 3ed9be4
| function summarizeCachedTemplates(templates: unknown[]): Array<Record<string, unknown>> { | ||
| return templates.map((t) => { | ||
| if (!t || typeof t !== 'object') return { unknown: true }; | ||
| const template = t as Record<string, unknown>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Rule violated: Prefer Typeguards over Type casting
Use a type guard instead of as for type narrowing. This is the same pattern as summarizeWorkflow - consider creating a shared isRecord() type guard.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 75:
<comment>Use a type guard instead of `as` for type narrowing. This is the same pattern as `summarizeWorkflow` - consider creating a shared `isRecord()` type guard.</comment>
<file context>
@@ -0,0 +1,250 @@
+function summarizeCachedTemplates(templates: unknown[]): Array<Record<string, unknown>> {
+ return templates.map((t) => {
+ if (!t || typeof t !== 'object') return { unknown: true };
+ const template = t as Record<string, unknown>;
+ return {
+ templateId: template.templateId,
</file context>
✅ Addressed in 3ed9be4
| return { unknown: true }; | ||
| } | ||
|
|
||
| const wf = workflow as Record<string, unknown>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Rule violated: Prefer Typeguards over Type casting
Use a type guard instead of as for type narrowing. The pattern typeof x === 'object' && x !== null can be encapsulated in a reusable type guard like isRecord(value): value is Record<string, unknown>.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 57:
<comment>Use a type guard instead of `as` for type narrowing. The pattern `typeof x === 'object' && x !== null` can be encapsulated in a reusable type guard like `isRecord(value): value is Record<string, unknown>`.</comment>
<file context>
@@ -0,0 +1,250 @@
+ return { unknown: true };
+ }
+
+ const wf = workflow as Record<string, unknown>;
+ const nodes = wf.nodes as Array<{ name?: string }> | undefined;
+ const connections = wf.connections as Record<string, unknown> | undefined;
</file context>
✅ Addressed in 3ed9be4
| // Handle workflowContext if present | ||
| if (filtered.workflowContext && typeof filtered.workflowContext === 'object') { | ||
| filtered.workflowContext = filterWorkflowContext( | ||
| filtered.workflowContext as Record<string, unknown>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Rule violated: Prefer Typeguards over Type casting
Use a type guard instead of as for type narrowing. The typeof === 'object' check validates runtime type but a type guard would properly narrow the TypeScript type.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/langsmith/trace-filters.ts, line 202:
<comment>Use a type guard instead of `as` for type narrowing. The `typeof === 'object'` check validates runtime type but a type guard would properly narrow the TypeScript type.</comment>
<file context>
@@ -0,0 +1,250 @@
+ // Handle workflowContext if present
+ if (filtered.workflowContext && typeof filtered.workflowContext === 'object') {
+ filtered.workflowContext = filterWorkflowContext(
+ filtered.workflowContext as Record<string, unknown>,
+ );
+ }
</file context>
✅ Addressed in 3ed9be4
| export function mergeLifecycles( | ||
| ...lifecycles: Array<Partial<EvaluationLifecycle> | undefined> | ||
| ): EvaluationLifecycle { | ||
| const validLifecycles = lifecycles.filter(Boolean) as Array<Partial<EvaluationLifecycle>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Rule violated: Prefer Typeguards over Type casting
Use a type predicate instead of as for type narrowing after filtering. The filter(Boolean) pattern doesn't automatically narrow types in TypeScript - use a type guard function instead.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/harness/lifecycle.ts, line 392:
<comment>Use a type predicate instead of `as` for type narrowing after filtering. The `filter(Boolean)` pattern doesn't automatically narrow types in TypeScript - use a type guard function instead.</comment>
<file context>
@@ -0,0 +1,437 @@
+export function mergeLifecycles(
+ ...lifecycles: Array<Partial<EvaluationLifecycle> | undefined>
+): EvaluationLifecycle {
+ const validLifecycles = lifecycles.filter(Boolean) as Array<Partial<EvaluationLifecycle>>;
+
+ return {
</file context>
| const validLifecycles = lifecycles.filter(Boolean) as Array<Partial<EvaluationLifecycle>>; | |
| const validLifecycles = lifecycles.filter( | |
| (lc): lc is Partial<EvaluationLifecycle> => Boolean(lc), | |
| ); |
✅ Addressed in 3ed9be4
| expect(getMajorityThreshold(5)).toBe(3); | ||
| }); | ||
|
|
||
| it('should return 2 for 2 judges', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Test description doesn't match assertion: the test says "should return 2 for 2 judges" but the assertion expects 1, not 2.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/multi-gen.test.ts, line 51:
<comment>Test description doesn't match assertion: the test says "should return 2 for 2 judges" but the assertion expects 1, not 2.</comment>
<file context>
@@ -0,0 +1,159 @@
+ expect(getMajorityThreshold(5)).toBe(3);
+ });
+
+ it('should return 2 for 2 judges', () => {
+ expect(getMajorityThreshold(2)).toBe(1);
+ });
</file context>
| it('should return 2 for 2 judges', () => { | |
| it('should return 1 for 2 judges', () => { |
✅ Addressed in 3ed9be4
| if (!Number.isFinite(numJudges) || numJudges < 1) { | ||
| throw new Error(`numJudges must be >= 1 (received ${String(numJudges)})`); | ||
| } | ||
| return Math.ceil(numJudges / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: For even numbers of judges, Math.ceil(numJudges / 2) returns the threshold for a tie (50%), not a majority (>50%). With 2 judges, this returns 1 (50%), and with 4 judges, it returns 2 (50%). A true majority threshold should use Math.floor(numJudges / 2) + 1.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/harness/multi-gen.ts, line 56:
<comment>For even numbers of judges, `Math.ceil(numJudges / 2)` returns the threshold for a tie (50%), not a majority (>50%). With 2 judges, this returns 1 (50%), and with 4 judges, it returns 2 (50%). A true majority threshold should use `Math.floor(numJudges / 2) + 1`.</comment>
<file context>
@@ -0,0 +1,97 @@
+ if (!Number.isFinite(numJudges) || numJudges < 1) {
+ throw new Error(`numJudges must be >= 1 (received ${String(numJudges)})`);
+ }
+ return Math.ceil(numJudges / 2);
+}
+
</file context>
| return Math.ceil(numJudges / 2); | |
| return Math.floor(numJudges / 2) + 1; |
| generateWorkflow: mockGenerateWorkflow, | ||
| }); | ||
|
|
||
| expect(callTimes).toHaveLength(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Test assertion doesn't verify parallelism as the test name claims. expect(callTimes).toHaveLength(3) only confirms all calls were made, not that they ran in parallel. Consider checking that all timestamps fall within a small window (e.g., all within 10ms of the first call) to verify true parallel execution.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/evaluators/pairwise.test.ts, line 497:
<comment>Test assertion doesn't verify parallelism as the test name claims. `expect(callTimes).toHaveLength(3)` only confirms all calls were made, not that they ran in parallel. Consider checking that all timestamps fall within a small window (e.g., all within 10ms of the first call) to verify true parallel execution.</comment>
<file context>
@@ -0,0 +1,500 @@
+ generateWorkflow: mockGenerateWorkflow,
+ });
+
+ expect(callTimes).toHaveLength(3);
+ });
+ });
</file context>
|
|
||
| const msg = { type: 'ai', content: 'hello' }; | ||
| const input: KVMap = { | ||
| cachedTemplates: [{ templateId: 't1', name: 'Template' }], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Test doesn't verify filtering behavior as claimed. The cachedTemplates input only contains templateId and name - the exact properties that summarizeCachedTemplates preserves. Consider adding extra properties (e.g., workflow: { nodes: [] }) to the template object to verify they are actually filtered out.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/trace-filters.test.ts, line 27:
<comment>Test doesn't verify filtering behavior as claimed. The `cachedTemplates` input only contains `templateId` and `name` - the exact properties that `summarizeCachedTemplates` preserves. Consider adding extra properties (e.g., `workflow: { nodes: [] }`) to the template object to verify they are actually filtered out.</comment>
<file context>
@@ -0,0 +1,38 @@
+
+ const msg = { type: 'ai', content: 'hello' };
+ const input: KVMap = {
+ cachedTemplates: [{ templateId: 't1', name: 'Template' }],
+ messages: [msg],
+ };
</file context>
| cachedTemplates: [{ templateId: 't1', name: 'Template' }], | |
| cachedTemplates: [{ templateId: 't1', name: 'Template', workflow: { nodes: [] }, fullDefinition: 'large data' }], |
✅ Addressed in 3ed9be4
| await expect(runV2Evaluation()).rejects.toThrow('process.exit(1)'); | ||
| }); | ||
|
|
||
| it('should exit with 0 when pass rate is exactly 70%', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: This test is redundant with 'should exit with 0 when pass rate >= 70%' above - both use passed: 7, totalExamples: 10 which is exactly 70%. Consider removing this duplicate test or changing the values to truly test the boundary (e.g., passed: 70, totalExamples: 100 to make the "exactly 70%" intent clearer).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/__tests__/cli.test.ts, line 401:
<comment>This test is redundant with 'should exit with 0 when pass rate >= 70%' above - both use `passed: 7, totalExamples: 10` which is exactly 70%. Consider removing this duplicate test or changing the values to truly test the boundary (e.g., `passed: 70, totalExamples: 100` to make the "exactly 70%" intent clearer).</comment>
<file context>
@@ -0,0 +1,459 @@
+ await expect(runV2Evaluation()).rejects.toThrow('process.exit(1)');
+ });
+
+ it('should exit with 0 when pass rate is exactly 70%', async () => {
+ mockRunEvaluation.mockResolvedValue(createMockSummary({ totalExamples: 10, passed: 7 }));
+
</file context>
- Make artifact output folder names concurrency-safe (index + deterministic short id) - Enable --output-dir artifact writing in LangSmith mode - Keep summary results stable by sorting by example index
burivuhster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactor! And kudos for the test coverage.
Couple of suggestions/comments
| // Build context - include generateWorkflow for multi-gen pairwise | ||
| const isMultiGen = args.suite === 'pairwise' && args.numGenerations > 1; | ||
| const llmCallLimiter = pLimit(args.concurrency); | ||
|
|
||
| const baseConfig = { | ||
| generateWorkflow, | ||
| evaluators, | ||
| lifecycle, | ||
| logger, | ||
| outputDir: args.outputDir, | ||
| timeoutMs: args.timeoutMs, | ||
| context: isMultiGen ? { generateWorkflow, llmCallLimiter } : { llmCallLimiter }, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason we need to pass generateWorkflow twice in the config?
| mode: 'langsmith', | ||
| dataset: args.datasetName ?? getDefaultDatasetName(args.suite), | ||
| langsmithClient: (() => { | ||
| if (!env.lsClient) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move this check to environment init logic?
| donts?: string; | ||
|
|
||
| numJudges: number; | ||
| numGenerations: number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify this and use --repetitions parameter everywhere instead?
| export function loadTestCasesFromCsv(csvPath: string): TestCase[] { | ||
| const resolvedPath = path.isAbsolute(csvPath) ? csvPath : path.resolve(process.cwd(), csvPath); | ||
|
|
||
| if (!existsSync(resolvedPath)) { | ||
| throw new Error(`CSV file not found at ${resolvedPath}`); | ||
| } | ||
|
|
||
| const fileContents = readFileSync(resolvedPath, 'utf8'); | ||
| const rows = parseCsv(fileContents); | ||
|
|
||
| if (rows.length === 0) { | ||
| throw new Error('The provided CSV file is empty'); | ||
| } | ||
|
|
||
| let header: ParsedCsvRow | undefined; | ||
| let dataRows = rows; | ||
|
|
||
| if (isHeaderRow(rows[0])) { | ||
| header = rows[0]!; | ||
| dataRows = rows.slice(1); | ||
| } | ||
|
|
||
| if (dataRows.length === 0) { | ||
| throw new Error('No prompt rows found in the provided CSV file'); | ||
| } | ||
|
|
||
| const promptIndex = header ? (detectColumnIndex(header, 'prompt') ?? 0) : 0; | ||
| const idIndex = header ? detectColumnIndex(header, 'id') : undefined; | ||
| const nameIndex = header | ||
| ? (detectColumnIndex(header, 'name') ?? detectColumnIndex(header, 'title')) | ||
| const dosIndex = header | ||
| ? (detectColumnIndex(header, 'dos') ?? detectColumnIndex(header, 'do')) | ||
| : undefined; | ||
| const dontsIndex = header | ||
| ? (detectColumnIndex(header, 'donts') ?? detectColumnIndex(header, 'dont')) | ||
| : undefined; | ||
|
|
||
| const testCases = dataRows | ||
| .map<TestCase | undefined>((row, index) => { | ||
| const prompt = sanitizeValue(row[promptIndex]); | ||
| if (!prompt) { | ||
| return undefined; | ||
| } | ||
|
|
||
| const idSource = sanitizeValue(idIndex !== undefined ? row[idIndex] : undefined); | ||
| const nameSource = sanitizeValue(nameIndex !== undefined ? row[nameIndex] : undefined); | ||
| const dosSource = sanitizeValue(dosIndex !== undefined ? row[dosIndex] : undefined); | ||
| const dontsSource = sanitizeValue(dontsIndex !== undefined ? row[dontsIndex] : undefined); | ||
|
|
||
| return { | ||
| id: idSource || `csv-case-${index + 1}`, | ||
| name: nameSource || generateNameFromPrompt(prompt, index), | ||
| ...(idSource ? { id: idSource } : { id: `csv-case-${index + 1}` }), | ||
| prompt, | ||
| ...((dosSource || dontsSource) && { | ||
| context: { | ||
| ...(dosSource ? { dos: dosSource } : {}), | ||
| ...(dontsSource ? { donts: dontsSource } : {}), | ||
| }, | ||
| }), | ||
| }; | ||
| }) | ||
| .filter((testCase): testCase is TestCase => testCase !== undefined); | ||
|
|
||
| if (testCases.length === 0) { | ||
| throw new Error('No valid prompts found in the provided CSV file'); | ||
| } | ||
|
|
||
| return testCases; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is a bit hard to read because of many undefineds, ternary and spread operators and different logic depending on whether we have a header.
Here's the Claude's take on how it can be siplified:
export function loadTestCasesFromCsv(csvPath: string): TestCase[] {
const resolvedPath = path.isAbsolute(csvPath) ? csvPath : path.resolve(process.cwd(), csvPath);
if (!existsSync(resolvedPath)) {
throw new Error(`CSV file not found at ${resolvedPath}`);
}
const rows = parseCsv(readFileSync(resolvedPath, 'utf8'));
if (rows.length === 0) {
throw new Error('The provided CSV file is empty');
}
const header = isHeaderRow(rows[0]) ? rows[0] : undefined;
const dataRows = header ? rows.slice(1) : rows;
if (dataRows.length === 0) {
throw new Error('No prompt rows found in the provided CSV file');
}
const col = (name: string, ...aliases: string[]) => {
if (!header) return undefined;
for (const n of [name, ...aliases]) {
const idx = detectColumnIndex(header, n);
if (idx !== undefined) return idx;
}
return undefined;
};
const promptIdx = col('prompt') ?? 0;
const idIdx = col('id');
const dosIdx = col('dos', 'do');
const dontsIdx = col('donts', 'dont');
const testCases: TestCase[] = [];
for (let i = 0; i < dataRows.length; i++) {
const row = dataRows[i];
const prompt = sanitizeValue(row[promptIdx]);
if (!prompt) continue;
const testCase: TestCase = {
id: sanitizeValue(row[idIdx!]) || `csv-case-${i + 1}`,
prompt,
};
const dos = sanitizeValue(row[dosIdx!]);
const donts = sanitizeValue(row[dontsIdx!]);
if (dos || donts) {
testCase.context = {};
if (dos) testCase.context.dos = dos;
if (donts) testCase.context.donts = donts;
}
testCases.push(testCase);
}
if (testCases.length === 0) {
throw new Error('No valid prompts found in the provided CSV file');
}
return testCases;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this whole judge/evaluators directory be moved under evaluators/llm-judge?
| /** Optional reference workflow for similarity-based checks */ | ||
| referenceWorkflow?: SimpleWorkflow; | ||
| /** Optional reference workflows for similarity-based checks (best match wins) */ | ||
| referenceWorkflows?: SimpleWorkflow[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be a single array referenceWorkflows, consumers may pass array of a single item instead of referenceWorkflow
| * Optional generator for multi-generation evaluations. | ||
| * When present, pairwise evaluator can generate multiple workflows from the same prompt. | ||
| */ | ||
| generateWorkflow?: (prompt: string) => Promise<SimpleWorkflow>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how much work is to move workflow generation upper in the call chain, but it would be nice to make evaluator agnostic of the multi-generation and just always process a single generation. It would require some higher-level code doing the metrics aggregation though.
| hash = Math.imul(hash, 0x01000193); | ||
| } | ||
| return hash >>> 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯
| ); | ||
| const evalDurationMs = Date.now() - evalStart; | ||
|
|
||
| const totalDurationMs = Date.now() - startTime; | ||
| const score = calculateExampleScore(feedback); | ||
| const status = hasErrorFeedback(feedback) | ||
| ? 'error' | ||
| : determineStatus({ score, passThreshold }); | ||
| stats.total++; | ||
| stats.scoreSum += score; | ||
| stats.durationSumMs += totalDurationMs; | ||
| if (status === 'pass') stats.passed++; | ||
| else if (status === 'fail') stats.failed++; | ||
| else stats.errors++; | ||
| const result: ExampleResult = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ); | |
| const evalDurationMs = Date.now() - evalStart; | |
| const totalDurationMs = Date.now() - startTime; | |
| const score = calculateExampleScore(feedback); | |
| const status = hasErrorFeedback(feedback) | |
| ? 'error' | |
| : determineStatus({ score, passThreshold }); | |
| stats.total++; | |
| stats.scoreSum += score; | |
| stats.durationSumMs += totalDurationMs; | |
| if (status === 'pass') stats.passed++; | |
| else if (status === 'fail') stats.failed++; | |
| else stats.errors++; | |
| const result: ExampleResult = { | |
| ); | |
| const evalDurationMs = Date.now() - evalStart; | |
| const totalDurationMs = Date.now() - startTime; | |
| const score = calculateExampleScore(feedback); | |
| const status = hasErrorFeedback(feedback) | |
| ? 'error' | |
| : determineStatus({ score, passThreshold }); | |
| stats.total++; | |
| stats.scoreSum += score; | |
| stats.durationSumMs += totalDurationMs; | |
| if (status === 'pass') stats.passed++; | |
| else if (status === 'fail') stats.failed++; | |
| else stats.errors++; | |
| const result: ExampleResult = { |
(Just line breaks to improve readability)
| │ runEvaluation(config) │ | ||
| │ │ | ||
| │ Config contains: │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASCII-art schemes have rough edges 😅
mike12345567
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great cleanup, really tidies this up which was starting to get a little out of control!
Left a few comments I think we should consider addressing, one other thing as well, I think it would be nice to maintain some of the old functionality we had, or at least have a package.json command to achieve it. Being able to run the matrix suite with our old prompts could be useful, because they've been a benchmark so far as we have an idea of how they score, I think it would also be nice to have a command for the old default output directory and markdown report generation - the JSON is really useful, but its also handy to have a quick report we can look through.
Really like the generation splitting up the workflow JSON and feedback JSON, thats a massive pain reduction, always found myself trying to scrape the correct workflow out of the massive results JSON!
| fb('maintainability.modularity', result.maintainability.modularity, 'detail'), | ||
|
|
||
| // Overall score | ||
| fb('overallScore', result.overallScore, 'score', result.summary), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also emit the score for the bestPractices?
| * Weights should sum to approximately 1.0. | ||
| */ | ||
| export const DEFAULT_EVALUATOR_WEIGHTS: ScoreWeights = { | ||
| 'llm-judge': 0.4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the similarity evals need a weighting as well?
| @@ -0,0 +1,331 @@ | |||
| import { mock } from 'jest-mock-extended'; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test suite useful, I don't think we usually do type testing, I'm wondering if its just an over-reach by Claude!
|
|
||
| #### 3. Workflow Evaluator (`chains/workflow-evaluator.ts`) | ||
| # Local: LLM-judge + programmatic | ||
| pnpm eval --prompt "Create a workflow that..." --verbose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not have an option now to run with our standard set of eval prompts as we did before?
|
|
||
| type FlagDef = { key: CliKey; kind: CliValueKind }; | ||
|
|
||
| const FLAG_TO_KEY: Record<string, FlagDef> = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a --help command to list some of this information? Tried it locally, think it could be helpful.
Signed-off-by: Oleg Ivaniv <[email protected]>
This comment has been minimized.
This comment has been minimized.
Use yargs with schema validation and help/alias handling, drop bespoke parsing test, and tidy imports. Signed-off-by: Oleg Ivaniv <[email protected]>
- remove yargs dependency and restore the manual argument parser - add focused parser tests for numeric flags, filters, and edge cases - clean up runner debug logging Signed-off-by: Oleg Ivaniv <[email protected]>
Move the lsClient check to happen right after environment setup rather than inline during config building. This fails fast when LANGSMITH_API_KEY is missing in langsmith mode.
- Add findColumn() helper to handle column aliases and header checks - Add getCell() helper to eliminate awkward ternary patterns - Replace map/filter chain with explicit for-loop - Build context conditionally with if-statements instead of nested spreads
Move all LLM judge-related files from evaluations/judge/ into evaluations/evaluators/llm-judge/ for better organization. - Move judge/evaluators/* to evaluators/llm-judge/evaluators/ - Move judge/evaluation.ts and workflow-evaluator.ts - Update all import paths Signed-off-by: Oleg Ivaniv <[email protected]>
Unify the API to only use referenceWorkflows (array) instead of having both referenceWorkflow (single) and referenceWorkflows (array). This simplifies the interface and reduces confusion. Added backwards compatibility in runner.ts to handle legacy datasets that still use the single referenceWorkflow field.
Use Node's built-in crypto module for generating deterministic short IDs instead of manual bit manipulation. Much more readable while providing the same functionality.
Separate logical groups in processExample with blank lines: - duration calculations - score and status - stats updates - result creation
Mermaid diagrams render cleanly on GitHub and avoid alignment issues with ASCII art.
Add explicit weight for similarity evaluator (0.15) and rebalance: - llm-judge: 0.35 (was 0.4) - programmatic: 0.25 (was 0.3) - pairwise: 0.25 (was 0.3) - similarity: 0.15 (new)
Move default test cases from hardcoded TypeScript array to a CSV fixture file at fixtures/default-prompts.csv. This makes it easier to modify test cases without code changes and provides a cleaner separation of data. - Add fixtures/default-prompts.csv with 10 default test prompts - Add loadDefaultTestCases() and getDefaultTestCaseIds() helpers - Update CLI to use CSV fixture instead of basicTestCases - Update README to document default prompts usage - Remove basicTestCases export from test-case-generator.ts
Auto-generate help text from flag definitions with descriptions and groups. Flags are organized by category: Input, Evaluation, Pairwise, LangSmith, Output, Feature Flags, and Advanced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 41 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/@n8n/ai-workflow-builder.ee/evaluations/README.md">
<violation number="1" location="packages/@n8n/ai-workflow-builder.ee/evaluations/README.md:81">
P3: The diagram is missing `'detail'` from the `kind` type. The actual `Feedback` interface (and the Feedback section later in this file) defines `kind: 'score' | 'metric' | 'detail'`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| F1["evaluator: string"] | ||
| F2["metric: string"] | ||
| F3["score: 0-1"] | ||
| F4["kind: 'score' | 'metric'"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: The diagram is missing 'detail' from the kind type. The actual Feedback interface (and the Feedback section later in this file) defines kind: 'score' | 'metric' | 'detail'.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/@n8n/ai-workflow-builder.ee/evaluations/README.md, line 81:
<comment>The diagram is missing `'detail'` from the `kind` type. The actual `Feedback` interface (and the Feedback section later in this file) defines `kind: 'score' | 'metric' | 'detail'`.</comment>
<file context>
@@ -38,47 +44,43 @@ popd
+ F1["evaluator: string"]
+ F2["metric: string"]
+ F3["score: 0-1"]
+ F4["kind: 'score' | 'metric'"]
+ F5["comment?: string"]
+ end
</file context>
| F4["kind: 'score' | 'metric'"] | |
| F4["kind: 'score' | 'metric' | 'detail'"] |
Summary
This PR rewrites the AI Workflow Builder evaluations into a single v2 harness with:
local+langsmithbackends) that owns dataset setup, concurrency, timeouts, scoring, and artifact writing.Feedback[]format.Why
The previous evals were hard to extend safely because responsibilities were duplicated across multiple runners:
What changed (high level)
1) Unified runner / harness
The core entrypoint is
evaluations/harness/runner.ts(runEvaluation(config)), which now owns:llmCallLimiter(p-limit), used for generation and evaluator LLM calls.withTimeout()(best-effort; does not cancel underlying requests).Feedback.kind.--output-dir) in local mode.Evaluators are now “plugins” that don’t care about backends; they just evaluate a workflow + context and return feedback.
2) Shared types + feedback contract
All evaluators emit
Feedback[](evaluations/harness/harness-types.ts):evaluator: stable evaluator id (e.g.llm-judge,pairwise,programmatic)metric: metric key (supports dot-path sub-metrics)score: normalized 0..1kind:score|metric|detail(used by the scorer;detailshould not affect overall scoring)comment: optional explanation / violationsThis contract is what keeps the harness readable: the runner doesn’t need evaluator-specific coercions or ad-hoc post-processing.
3) LangSmith integration fixes (reliable traces + smaller payloads)
Key decisions in the LangSmith runner:
evaluate()target withtraceable(); the LangSmith SDK wraps it and attaches critical options (linking to examples, run lifecycle, client/project settings).traceable()for child traces and explicitly passclient: lsClientso child traces attach to the same run tree.To reduce payload size (and avoid 403 multipart errors on large runs), we keep “minimal tracing” enabled by default and filter heavy state fields via
hideInputs/hideOutputs(evaluations/langsmith/trace-filters.ts). We explicitly keepmessagesuntrimmed to avoid breaking downstream expectations.4) Metric key compatibility for comparing old vs new runs
LangSmith metric key mapping is centralized (
evaluations/harness/feedback.ts):overallScore,maintainability.workflowOrganization) to match historical dashboards.programmatic.trigger) so they don’t collide with LLM-judge root keys.pairwise_primary,pairwise_total_violations, etc.) as raw counts/ratios as before; additional judge/gen details are namespaced (e.g.pairwise.judge1).5) Directory structure (separation by responsibility)
The evals folder is reorganized so ownership is obvious:
evaluations/cli/: CLI entry + arg parsing + CSV loaderevaluations/harness/: runner, lifecycle logging, scoring, artifacts, helper utilitiesevaluations/evaluators/: evaluator factories (llm-judge, pairwise, programmatic, similarity)evaluations/judge/: LLM-judge internals (schema + category evaluators + workflow evaluator)evaluations/langsmith/: LangSmith helpers (types + trace filters)evaluations/support/: environment setup (LLM + node types + LangSmith client), node loading, reports, test-case generationevaluations/programmatic/python/: unchanged layout (Python similarity tooling)Legacy/categorization evaluation artifacts were removed as part of cleanup.
Behavior changes / notes
--concurrency(global limiter); nested LLM calls are routed through the same limiter to prevent multiplicative parallelism.--output-dir(one folder per example +summary.json).--prompt,--prompts-csv,--test-case) and requires--dataset.Tests
This PR adds/expands Jest coverage for:
dos/donts)p-limitRun from
packages/@n8n/ai-workflow-builder.ee:How to verify (manual)
From
packages/@n8n/ai-workflow-builder.ee:Related Linear tickets, Github issues, and Community forum posts
Review / Merge checklist
release/backport(if the PR is an urgent fix that needs to be backported)