Skip to content

feat: eval pipeline#5369

Merged
rguan72 merged 69 commits intomainfrom
richard/eval-pipeline
Sep 17, 2025
Merged

feat: eval pipeline#5369
rguan72 merged 69 commits intomainfrom
richard/eval-pipeline

Conversation

@rguan72
Copy link
Contributor

@rguan72 rguan72 commented Sep 9, 2025

Description

Eval Pipeline

  • supports arbitrary LLM / Persona configuration (for tools, etc)
  • configurability for deep research / thoughtful and other input arguments
  • runs locally or remotely (with cloud superuser api key)

How Has This Been Tested?

Tested locally and with test.onyx.app
Will add unit tests once we can dependency inject some llm dependencies [for stream_message_objects]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Add a minimal evaluation pipeline using Braintrust to test our chat answer generation end-to-end with OpenAI GPT-4.1. Includes a tutorial script that builds prompts with AnswerPromptBuilder and returns concatenated streamed output.

  • New Features

    • Added backend/evals/eval_tutorial.py with a Braintrust Eval that calls get_answer.
    • Uses get_llm to initialize GPT-4.1 (dotenv-loaded env), with fast_llm sharing the same model.
    • Falls back to a mocked SQLAlchemy Session if tenant DB session is unavailable.
    • Builds prompts via AnswerPromptBuilder and aggregates streamed Answer content.
    • Ships with a small sample dataset and empty tools list for easy extension.
  • Dependencies

    • Added braintrust==0.2.6 and autoevals==0.0.130 to backend/requirements/default.txt.

@rguan72 rguan72 requested a review from a team as a code owner September 9, 2025 02:11
@vercel
Copy link

vercel bot commented Sep 9, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Sep 17, 2025 6:42pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces an evaluation pipeline for the Onyx system by adding two new dependencies and creating an evaluation tutorial. The changes add braintrust==0.2.6 and autoevals==0.0.130 to the default requirements, which are libraries for ML model evaluation and automated scoring. A new tutorial file backend/evals/eval_tutorial.py demonstrates how to use Braintrust to evaluate the Answer generation functionality.

The evaluation pipeline integrates with Onyx's existing Answer class infrastructure, setting up LLM configurations, prompt builders, and database sessions to generate responses for test questions. The implementation includes fallback mechanisms to work in environments where full infrastructure might not be available, using mock database sessions when real connections aren't accessible. The tutorial creates evaluation experiments that can systematically test response quality using OpenAI's GPT-4.1 model, with the results tracked through the Braintrust platform.

This addition represents a move toward more systematic evaluation and monitoring of the AI system's performance, potentially supporting A/B testing and ongoing quality assessment in production environments. The evaluation framework appears designed to be extensible for testing different aspects of the Onyx system's AI capabilities.

Confidence score: 2/5

  • This PR has several technical issues that could cause runtime failures and needs significant cleanup before merging
  • Score reflects problematic code patterns including hardcoded model names, improper error handling, and unused variables that violate coding standards
  • Pay close attention to backend/evals/eval_tutorial.py which contains multiple issues that need to be addressed

Context used:

Rule - When hardcoding a boolean variable to a constant value, remove the variable entirely and clean up all places where it's used rather than just setting it to a constant. (link)
Context - Use explicit type annotations for variables to enhance code clarity, especially when moving type hints around in the code. (link)

2 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 2 files

React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.

@rguan72 rguan72 merged commit c558732 into main Sep 17, 2025
52 of 54 checks passed
@rguan72 rguan72 deleted the richard/eval-pipeline branch September 17, 2025 19:17
razvanMiu pushed a commit to eea/danswer that referenced this pull request Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants