Skip to content

feat(ai-builder): Improve eval verifier and mock handler reliability (no-changelog)#28255

Open
JoseBra wants to merge 2 commits intomasterfrom
trust-34-improve-llm-response-parsing-in-eval-verifier-and-mock
Open

feat(ai-builder): Improve eval verifier and mock handler reliability (no-changelog)#28255
JoseBra wants to merge 2 commits intomasterfrom
trust-34-improve-llm-response-parsing-in-eval-verifier-and-mock

Conversation

@JoseBra
Copy link
Copy Markdown
Contributor

@JoseBra JoseBra commented Apr 9, 2026

Summary

  • Use structured output on the verifier to eliminate "No verification result" failures — response is now schema-validated instead of parsed from free-form text
  • Add retry logic to the mock handler with configurable maxRetries (default 1) to recover from transient LLM failures
  • Improve failure categorization — verifier now traces data flow before attributing failures, correctly distinguishing builder logic errors from mock data issues
  • Remove hardcoded maxTokens overrides, let the AI SDK use model defaults
  • Make test case scenarios less brittle (remove exact count expectations, use minimal response hints)

Related Linear ticket

https://linear.app/n8n/issue/TRUST-34

Test plan

  • Mock handler tests: 32/32 pass
  • Typecheck clean (instance-ai)
  • Full suite run: zero "No verification result", 1 mock_issue out of 27 scenarios

🤖 Generated with Claude Code

…(no-changelog)

- Use structured output on the verifier to eliminate "No verification
  result" failures — the LLM response is now schema-validated instead
  of parsed from free-form text
- Add retry logic to the mock handler with configurable maxRetries
  (default 1) to recover from transient LLM failures
- Improve failure categorization in the verifier prompt — trace data
  flow before attributing failures, distinguishing builder logic errors
  from mock data issues
- Remove hardcoded maxTokens overrides, let the AI SDK use model defaults
- Make test case scenarios less brittle: remove exact count expectations
  from rest-api-data-pipeline, use minimal Notion response hints

Ref: https://linear.app/n8n/issue/TRUST-34

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...s/cli/src/modules/instance-ai/eval/mock-handler.ts 92.30% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@n8n-assistant n8n-assistant bot added core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team labels Apr 9, 2026
@JoseBra JoseBra marked this pull request as ready for review April 9, 2026 18:54
@JoseBra JoseBra requested review from a team and schrothbn and removed request for a team April 9, 2026 18:55
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/@n8n/instance-ai/evaluations/data/workflows/rest-api-data-pipeline.json">

<violation number="1" location="packages/@n8n/instance-ai/evaluations/data/workflows/rest-api-data-pipeline.json:11">
P2: The happy-path success criteria dropped verification of the required post count, so this test can pass even if the workflow misses part of the requested Slack summary.</violation>
</file>

<file name="packages/@n8n/instance-ai/evaluations/checklist/verifier.ts">

<violation number="1" location="packages/@n8n/instance-ai/evaluations/checklist/verifier.ts:12">
P1: Structured output schema does not match the verifier prompt contract (array + nullable fields), so valid model responses can be rejected and still fall back to missing verification results.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Runner as Eval Runner
    participant Mock as Mock Handler
    participant LLM as AI Provider (SDK)
    participant Verifier as Eval Verifier

    Note over Runner,Verifier: Phase 1: Workflow Execution (Mocking)

    Runner->>Mock: interceptRequest(node, context)
    
    loop NEW: Retry Logic (maxRetries)
        Mock->>LLM: generate(userPrompt)
        Note right of LLM: CHANGED: Using model default maxTokens
        alt Success
            LLM-->>Mock: JSON specification
            Mock->>Mock: materializeSpec()
        else LLM Error / Timeout
            LLM-->>Mock: Transient Failure
            Note over Mock: Log warning & decrement retries
        end
    end

    alt All Retries Failed
        Mock-->>Runner: Return _evalMockError object
    else Success
        Mock-->>Runner: Return mocked response body/status
    end

    Note over Runner,Verifier: Phase 2: Result Verification

    Runner->>Verifier: verify(artifact, checklist)
    
    Verifier->>LLM: NEW: structuredOutput(schema)
    Note right of LLM: Uses checklistResultSchema (Zod)
    
    LLM-->>Verifier: Validated JSON Object
    
    alt Schema Validation Success
        Verifier->>Verifier: CHANGED: Trace data flow
        Note right of Verifier: Categorize: builder_issue vs mock_issue
        Verifier-->>Runner: List of ChecklistResult
    else NEW: Schema Validation Failure (null)
        Verifier->>Verifier: Log warning (No results)
        Verifier-->>Runner: Empty results array
    end
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

…changelog)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant