feat: eval pipeline by rguan72 · Pull Request #5369 · onyx-dot-app/onyx

rguan72 · 2025-09-09T02:11:47Z

Description

Eval Pipeline

supports arbitrary LLM / Persona configuration (for tools, etc)
configurability for deep research / thoughtful and other input arguments
runs locally or remotely (with cloud superuser api key)

How Has This Been Tested?

Tested locally and with test.onyx.app
Will add unit tests once we can dependency inject some llm dependencies [for stream_message_objects]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

Summary by cubic

Add a minimal evaluation pipeline using Braintrust to test our chat answer generation end-to-end with OpenAI GPT-4.1. Includes a tutorial script that builds prompts with AnswerPromptBuilder and returns concatenated streamed output.

New Features
- Added backend/evals/eval_tutorial.py with a Braintrust Eval that calls get_answer.
- Uses get_llm to initialize GPT-4.1 (dotenv-loaded env), with fast_llm sharing the same model.
- Falls back to a mocked SQLAlchemy Session if tenant DB session is unavailable.
- Builds prompts via AnswerPromptBuilder and aggregates streamed Answer content.
- Ships with a small sample dataset and empty tools list for easy extension.
Dependencies
- Added braintrust==0.2.6 and autoevals==0.0.130 to backend/requirements/default.txt.

vercel · 2025-09-09T02:11:54Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
internal-search	Ready	Preview	Comment	Sep 17, 2025 6:42pm

greptile-apps

Greptile Summary

This PR introduces an evaluation pipeline for the Onyx system by adding two new dependencies and creating an evaluation tutorial. The changes add braintrust==0.2.6 and autoevals==0.0.130 to the default requirements, which are libraries for ML model evaluation and automated scoring. A new tutorial file backend/evals/eval_tutorial.py demonstrates how to use Braintrust to evaluate the Answer generation functionality.

The evaluation pipeline integrates with Onyx's existing Answer class infrastructure, setting up LLM configurations, prompt builders, and database sessions to generate responses for test questions. The implementation includes fallback mechanisms to work in environments where full infrastructure might not be available, using mock database sessions when real connections aren't accessible. The tutorial creates evaluation experiments that can systematically test response quality using OpenAI's GPT-4.1 model, with the results tracked through the Braintrust platform.

This addition represents a move toward more systematic evaluation and monitoring of the AI system's performance, potentially supporting A/B testing and ongoing quality assessment in production environments. The evaluation framework appears designed to be extensible for testing different aspects of the Onyx system's AI capabilities.

Confidence score: 2/5

This PR has several technical issues that could cause runtime failures and needs significant cleanup before merging
Score reflects problematic code patterns including hardcoded model names, improper error handling, and unused variables that violate coding standards
Pay close attention to backend/evals/eval_tutorial.py which contains multiple issues that need to be addressed

Context used:

Rule - When hardcoding a boolean variable to a constant value, remove the variable entirely and clean up all places where it's used rather than just setting it to a constant. (link)
Context - Use explicit type annotations for variables to enhance code clarity, especially when moving type hints around in the code. (link)

_{2 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

backend/evals/eval_tutorial.py

cubic-dev-ai

5 issues found across 2 files

_{React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

backend/evals/eval_tutorial.py

.

6050371

rguan72 requested a review from a team as a code owner September 9, 2025 02:11

greptile-apps bot reviewed Sep 9, 2025

View reviewed changes

backend/evals/eval_tutorial.py Outdated Show resolved Hide resolved

backend/evals/eval_tutorial.py Outdated Show resolved Hide resolved

backend/evals/eval_tutorial.py Outdated Show resolved Hide resolved

vercel bot deployed to Preview September 9, 2025 02:14 View deployment

cubic-dev-ai bot reviewed Sep 9, 2025

View reviewed changes

Commit changes

8a3ae51

rguan72 marked this pull request as draft September 9, 2025 20:12

vercel bot deployed to Preview September 9, 2025 20:13 View deployment

rguan72 added 2 commits September 9, 2025 20:14

.

3afd3c7

.

743674d

vercel bot deployed to Preview September 10, 2025 03:19 View deployment

.

4952cfa

vercel bot deployed to Preview September 10, 2025 19:44 View deployment

.

b58d194

vercel bot deployed to Preview September 10, 2025 19:47 View deployment

.

90dd786

vercel bot deployed to Preview September 10, 2025 20:12 View deployment

.

fd03c60

vercel bot deployed to Preview September 10, 2025 20:48 View deployment

.

02b41a0

vercel bot deployed to Preview September 10, 2025 20:58 View deployment

.

f67ff6f

vercel bot deployed to Preview September 10, 2025 21:39 View deployment

.

96932dc

vercel bot deployed to Preview September 10, 2025 22:13 View deployment

.

0e7addf

vercel bot deployed to Preview September 10, 2025 22:20 View deployment

.

6b187d9

vercel bot deployed to Preview September 10, 2025 22:30 View deployment

vercel bot deployed to Preview September 16, 2025 21:22 View deployment

.

1bfb90c

vercel bot deployed to Preview September 16, 2025 22:03 View deployment

.

8eb0575

vercel bot deployed to Preview September 16, 2025 23:05 View deployment

Merge branch 'main' into richard/eval-pipeline

c501bbb

vercel bot deployed to Preview September 17, 2025 00:18 View deployment

.

7329d06

vercel bot deployed to Preview September 17, 2025 00:48 View deployment

.

d254401

vercel bot deployed to Preview September 17, 2025 00:54 View deployment

rguan72 added 2 commits September 16, 2025 17:59

.

031140f

.

1ecab74

vercel bot deployed to Preview September 17, 2025 01:04 View deployment

.

909519f

vercel bot deployed to Preview September 17, 2025 01:28 View deployment

Merge branch 'main' into richard/eval-pipeline

1cacddb

vercel bot deployed to Preview September 17, 2025 02:04 View deployment

rguan72 added 2 commits September 16, 2025 19:05

.

a0256a8

.

933bbc7

vercel bot deployed to Preview September 17, 2025 02:10 View deployment

.

4e53f8a

vercel bot deployed to Preview September 17, 2025 02:41 View deployment

.

5ee321d

vercel bot deployed to Preview September 17, 2025 17:03 View deployment

.

009ce81

vercel bot deployed to Preview September 17, 2025 18:42 View deployment

rguan72 merged commit c558732 into main Sep 17, 2025
52 of 54 checks passed

rguan72 deleted the richard/eval-pipeline branch September 17, 2025 19:17

razvanMiu pushed a commit to eea/danswer that referenced this pull request Oct 16, 2025

feat: eval pipeline (onyx-dot-app#5369)

8e9ce51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval pipeline#5369

feat: eval pipeline#5369
rguan72 merged 69 commits intomainfrom
richard/eval-pipeline

rguan72 commented Sep 9, 2025 •

edited

Loading

Uh oh!

vercel bot commented Sep 9, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rguan72 commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

Summary by cubic

Uh oh!

vercel bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Context used:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rguan72 commented Sep 9, 2025 •

edited

Loading

vercel bot commented Sep 9, 2025 •

edited

Loading