Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Greptile Summary
This PR introduces an evaluation pipeline for the Onyx system by adding two new dependencies and creating an evaluation tutorial. The changes add braintrust==0.2.6 and autoevals==0.0.130 to the default requirements, which are libraries for ML model evaluation and automated scoring. A new tutorial file backend/evals/eval_tutorial.py demonstrates how to use Braintrust to evaluate the Answer generation functionality.
The evaluation pipeline integrates with Onyx's existing Answer class infrastructure, setting up LLM configurations, prompt builders, and database sessions to generate responses for test questions. The implementation includes fallback mechanisms to work in environments where full infrastructure might not be available, using mock database sessions when real connections aren't accessible. The tutorial creates evaluation experiments that can systematically test response quality using OpenAI's GPT-4.1 model, with the results tracked through the Braintrust platform.
This addition represents a move toward more systematic evaluation and monitoring of the AI system's performance, potentially supporting A/B testing and ongoing quality assessment in production environments. The evaluation framework appears designed to be extensible for testing different aspects of the Onyx system's AI capabilities.
Confidence score: 2/5
- This PR has several technical issues that could cause runtime failures and needs significant cleanup before merging
- Score reflects problematic code patterns including hardcoded model names, improper error handling, and unused variables that violate coding standards
- Pay close attention to
backend/evals/eval_tutorial.pywhich contains multiple issues that need to be addressed
Context used:
Rule - When hardcoding a boolean variable to a constant value, remove the variable entirely and clean up all places where it's used rather than just setting it to a constant. (link)
Context - Use explicit type annotations for variables to enhance code clarity, especially when moving type hints around in the code. (link)
2 files reviewed, no comments
Description
Eval Pipeline
How Has This Been Tested?
Tested locally and with test.onyx.app
Will add unit tests once we can dependency inject some llm dependencies [for stream_message_objects]
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Add a minimal evaluation pipeline using Braintrust to test our chat answer generation end-to-end with OpenAI GPT-4.1. Includes a tutorial script that builds prompts with AnswerPromptBuilder and returns concatenated streamed output.
New Features
Dependencies